This article discusses the scope of mathematical programming projects using the example of several successful open source/free software projects. In the author’s experience, it is common to encounter rather optimistic expectations of the cost, schedule, and risk level of mathematical research and development and programming projects; the two categories, mathematical research and development and mathematical programming, are heavily blurred together today, especially for practical and applied projects. This article provides a rough picture of the scope of mathematical programming projects based on actual historical data rather than popular culture, anecdote, or folk wisdom.
There are strong practical reasons for applied mathematical research and development and programming projects. Many potentially beneficial projects exist. These projects often suffer from the “cure for cancer” problem. With several hundred thousand people each year in the United States alone succumbing to cancer, there is little question that there is a large market for a cure for cancer. The problem is that we do not know how to cure cancer. Similarly, successful mathematical research and development and programming projects offer everything from profitable investment advice to speech recognition for mobile devices and household appliances to working fusion reactors and other new energy sources. Indeed, a cure for cancer is something that mathematical methods may offer in the future through molecular modeling or other quantitative approaches. Given the huge potential markets for successful mathematical projects, it is common to encounter individuals, organizations, and companies with great interest in particular, usually practical mathematical projects. These projects are highly unlikely to succeed without accurate ideas about the scope of the projects.
The Scope of Some Successful Free Open-Source Mathematical Software Projects
Program | Lines of Code | Core Lines of Code | Calendar Duration | Number of Contributors |
FFMPEG 0.6.1 Video Encoder | 373,742 | 368,457 | at least 2004-2011 | 50 |
x264 h.264 Video Encoder (x264-snapshot-20110204-2245) |
67,986 | 62,968 | at least 2004-2011 | 18 |
Independent JPEG Group JPEG encoder/decoder v8c | 61,102 | 52,304 | at least 2000-2011 | 13 |
Open CV 2.2.0 Computer Vision Library | 884,808 | 396,399 | at least 1999-2011 | 80 |
Insight Toolkit 3.20.0 Image Segmentation and Registration Toolkit |
698,143 | 685,466 | at least 1999-2011 | at least 14 |
Pythia/Lund Monte Carlo 8.145 Particle Physics Event Simulation (C++ version) |
141,353 | 46,258 | 1977-2011 | 5 |
Pythia/Lund Monte Carlo 6.327 Particle Physics Event Simulation (last FORTRAN Version) |
60,455 | 60,455 | 1977-1996 | 5 |
EGS (Electron Gamma Shower) | 38,921 | 31,151 | 1950’s-2011 | unknown |
LAPACK 3.3.0 Linear Algebra Library | 459,993 | 458,645 | at least 1970’s to present | many contributors (probably over 100) |
AESCRYPT Encryption/Decryption Utility | 4,331 | 4,286 | 2001-2009 | at least 2 |
GNU Privacy Guard (GNUpg) v. 1.4.11 | 148,374 | 120,441 | at least 1998 to 2008 | 47 |
Octave 3.2.4 Numerical Programming Tool |
539,233 | 453,160 | at least 1980s to present | many contributors (probably over 100) |
Notes
The free, open-source CLOC (Count Lines of Code) utility was used to count the number of lines of code in each project. CLOC lists the number of lines of code in each programming language in the project such as C, C++, Bourne Shell, HTML, and so forth. CLOC does not count blank lines or comment lines. Some projects include sizable amounts of installation code (in the Unix Bourne Shell for example), HTML documentation, and so forth which is counted in the total number of lines of code reported by CLOC. The actual mathematical code is typically implemented in a few languages such as C, C++, FORTRAN, or MATLAB. The term “Core Lines of Code” refers to the lines of code in these languages, as reported by CLOC, which is presumed to contain the actual mathematical software.
In general, open source projects provide a wealth of detailed information that is difficult or impossible to acquire for many commercial proprietary projects. In particular, one can see the source code, count the lines of code or other measures of size and scope, and often read comments, change logs, logs of version control systems, and so forth. Nearly all open source projects give a list of contributors somewhere in the documentation and provide rough information on the calendar duration of the project. There is usually precise information on releases and release dates. Unfortunately, it is difficult to get a reasonably exact measure of the actual effort expended on the project. Most open source projects do not publish information on exact hours worked, dollars expended, even if records exist. Several of the examples were fully or partially funded either by government funding agencies (e.g. the National Library of Medicine for the Insight Toolkit) or private sources (e.g. Intel for OpenCV), so such detailed information may be available in some cases.
The Examples
The examples were chosen as successful free open-source projects widely used within their field or application with a quality comparable to or superior to good commercial software products. Several such as FFMPEG and x264 are highly applied and used in the everyday world. Several such as the Pythia/Lund Monte Carlo are primarily scientific research tools. Some such as Octave and LAPACK span both worlds.
FFMPEG is a widely used open source audio/video encoding utility and collection of libraries. FFMPEG can encode and decode a wide range of different audio and video formats and compression schemes including h.264. It incorporates a number of other utilities and libraries. x264 is a widely used open source h.264 video encoder. The Independent JPEG Group disributes a widely used open source JPEG image encoder and decoder. Open CV is a widely used computer vision library incorporating many of the current state of the art computer vision algorithms; it is used in research and in a few commercial products. The Insight Toolkit is a toolkit of image segmentation and registration algorithms, somewhat similar to Open CV in practice, geared towward medical imaging.
The Pythia/Lund Monte Carlo is a widely used program for simulating the formation of jets of subatomic particles and other processes in experimental and theoretical particle physics, for example at the Large Hadron Collider (LHC) at CERN. Two versions, the original FORTRAN version and the more recent rewrite in C++, are listed. Electron Gamma Shower or EGS is a widely used program for simulating the interactions of electrons and photons (gamma rays and x-rays) with matter. It was originally developed for nuclear and particle phyics at the Stanford Linear Accelerator Center (SLAC), but is now widely used for medical radiation studies. LAPACK is a widely used FORTRAN library of linear algebra and other basic numerical algorithms; it is often found in other programs as well. AESCRYPT is a free open-source implementation of the Advanced Encryption Standard (AES) for data encryption. GNU Privacy Guard (GNUpg) is a free, open-source implementation of the OpenPGP encryption standard. Octave is a free, open-source numerical programming tool that is mostly compatibly with MATLAB. Octave has been discussed in previous articles by this author starting with Octave: An Alternative to the High Cost of MATLAB.
Actual Effort Estimation with Basic COCOMO
The Constructive Cost Model (COCOMO) is a software cost estimation model developed by Barry Boehm. Basic COCOMO is the original, very simple cost estimation model published by Boehm in his 1981 book Software Engineering Economics. It gives a simple, crude estimate of the effort in man-months as a function of the number of lines of code in a project. The following table gives the estimated effort in man-months/man-years from applying the “organic” Basic COCOMO model to the number of lines of code in each mathematical open source project in this article:
Program | Basic COCOMO Man-Months | Basic COCOMO Man-Years |
FFMPEG 0.6.1 | 1,204 | 100 |
x264 | 201.5 | 16.75 |
IJG v8c | 179.8 | 15 |
Open CV 2.2.0 | 2,982 | 248.5 |
Insight Toolkit | 2,324 | 193.7 |
Pythia/Lund 8.145 | 443 | 36 |
Pythia/Lund 6.327 | 178 | 14.8 |
EGS | 112 | 9.3 |
LAPACK 3.3.0 | 1,637 | 136.4 |
AESCRYPT | 11 | 0.9 |
GNU Privacy Guard (GNUpg) 1.4.11 | 456 | 38 |
Octave 3.2.4 | 1,771 | 147.6 |
The following Octave/MATLAB function was used to compute the estimated man-months using the Basic COCOMO “organic” model:
function [man_months, dev_time, people_required] = cocomo(kloc, type) % [man_months, dev_time, people_required] = cocomo(kloc [, type]) % % kloc (thousands of lines of code) % type (type of project: organic, semi-detached, embedded) % if nargin < 2 type = 'organic' end c = 2.5; if strcmp(type, 'organic') a = 2.4; b = 1.05; d = 0.38; end if strcmp(type, 'semi') % semi detached a = 3.0; b = 1.12; d = 0.35; end if strcmp(type, 'embedded') a = 3.6; b = 1.2; d = 0.32; end man_months = a*(kloc)^b; dev_time = c*(man_months)^d; people_required = man_months / dev_time; end
Conclusion
While this data sample is clearly limited and a larger study is desirable, it should nonetheless be evident that successful mathematical programming projects are usually substantial. Even the smallest project on the list, the AESCRYPT encryption utility, probably took several man-months to fully develop; Basic COCOMO would estimate almost one year. Thus, expectations of a few weeks are generally unrealistic. Indeed, expectations of three calendar months, a fiscal quarter, the current fetish of American business, are usually unrealistic. On the other hand, expectations ranging from six months to several years may be realistic depending on the specific project.
In part because of heavy government funding of mathematical research and development, there are a large number of open-source, free mathematical programming projects available. This provides an excellent database of information on the size and scope of such projects, something often difficult to find for business applications where most products and projects are proprietary. Anyone considering such a mathematical project is well advised to examine comparable open source projects if they exist to determine the size and scope to the extent possible. Unfortunately, open source projects often can give only a rough measure of the actual effort (mythical man-months) used in the project. The Basic COCOMO model can provide a very rough way of estimating the actual effort of the open source project from the lines of code, but clearly a more direct way of measuring the actual effort is needed.
© 2011 John F. McGowan
About the Author
John F. McGowan, Ph.D. is a software developer, research scientist, and consultant. He works primarily in the area of complex algorithms that embody advanced mathematical and logical concepts, including speech recognition and video compression technologies. He has extensive experience developing software in C, C++, Visual Basic, Mathematica, MATLAB, and many other programming languages. He is probably best known for his AVI Overview, an Internet FAQ (Frequently Asked Questions) on the Microsoft AVI (Audio Video Interleave) file format. He has worked as a contractor at NASA Ames Research Center involved in the research and development of image and video processing algorithms and technology. He has published articles on the origin and evolution of life, the exploration of Mars (anticipating the discovery of methane on Mars), and cheap access to space. He has a Ph.D. in physics from the University of Illinois at Urbana-Champaign and a B.S. in physics from the California Institute of Technology (Caltech). He can be reached at jmcgowan11@earthlink.net.
Sponsor’s message: Check out Math Better Explained, an insightful ebook and screencast series that will help you see math in a new light and experience more of those awesome “aha!” moments when ideas suddenly click.
I suggest that you add GIMPS, the Great Internet Mersenne Prime Search, hosted at https://mersenne.org. GIMPS federates the task of finding Mersenne primes, primes of the form 2^p – 1, to people who contribute their computers’ idle times to the effort.
At any time, the record for largest-known-prime is quite likely to have been discovered by a member of the GIMPS project.