Octave is a free, open-source high-level interpreted language, primarily intended for numerical computations that is mostly compatible with MATLAB. Octave is an excellent tool for the rapid research and development of new algorithms as well as performing simulations and data analysis. A mathematical software developer can often prototype a new algorithm in Octave two to three times faster than in a compiled programming language such as C or C++. Octave is free both as in beer and as in speech unlike MATLAB. Anyone can download Octave and run an Octave program at no cost on the three major computing platforms: MS Windows, Mac OS X, and other forms of the Unix operating system. Because Octave is open-source, there is much less concern that the vendor will suddenly cease support as Microsoft did with Visual FoxPro or redesign the language into something unusable in order to sell yet another “upgrade.” End users can always build the language from source and create a development “fork” that preserves the compatibility with existing code and the elegance of the original language.
The Problem
A major problem with Octave and many other scripting languages is that it is an interpreted, human-readable scripting language. Potential and actual customers and other third parties can see what is being done in detail. It is easy to reverse engineer or steal programs and algorithms written in scripting languages such as Octave.
Imagine that you are small company operating on a shoe string budget in a loft in West Hollywood that has developed a breakthrough video special effect in Octave. You want to win a contract from a Hollywood movie studio to do the effect in the next blockbuster science fiction movie starring Angelina Jolie and Brad Pitt as quarreling lovers caught in an alien invasion. The famous Hollywood movie studio wants to evaluate the algorithm in-house, make sure you are not cheating with Photoshop on the glamor shot of Angelina in a skin-tight black leather jumpsuit that they sent you. The problem is that the famous Hollywood studio that you are pitching to would steal your algorithm in a microsecond if they could. You are confronted with the cost, time, and general difficulty of converting your hot new video special effect algorithm into a compiled language such as C or C++. Meanwhile your competitors at Really Cool FX in Pasadena may come out with the same algorithm while you are struggling to convert it to C or C++.
You could be a quantitative finance wizard operating out of a poorly ventilated office in Jersey City, New Jersey with a spectacular view of scenic downtown Jersey City visible through your tiny west facing window. You would like to sell your hot new nanosecond trading algorithm to a Too Big Too Fail bank so you can move to a plush well ventilated corner office across the Hudson River in New York City’s financial district, but the bank insists they must thoroughly evaluate the algorithm in-house. Probably enough said right there.
You might be an idealistic junior faculty member at a prestigious, but very low paying major research university in San Francisco. You have developed the breakthrough algorithm in quantitative biology that will cure cancer — in Octave. Now, you are completely above crass materialistic concerns and plan to follow the illustrious example of Jonas Salk in refusing to patent the polio vaccine :-), donate regularly to the Free Software Foundation, and have an autographed poster of Richard Stallman in your tiny cramped office, but nonetheless you would like to get tenure and move out of your landlady’s attic. You know full well that the eminent full professor down the hall who got passed over for last year’s Nobel Prize would steal your idea in a picosecond if he could; it is common knowledge in the department that his didn’t-quite-get-the-Nobel-Prize work was actually stolen from his former graduate student who is now driving a taxicab in New York City. How do you demonstrate your breakthrough algorithm without giving away the secret and get tenure?
The Solution
Fortunately, one can obfuscate Octave code, removing nearly all human-readable information, much as a compiler does when it translates a program written in C or C++ into a machine-readable binary executable. This raises the bar for stealing your ideas and algorithms considerably. In general, code obfuscation removes all comments, indentation and other formatting that clarifies what is going on, and replaces all human readable variable and function names with random strings of characters that convey no meaning to a human reader. Note that the human readable information is completely removed from the obfuscated code. Some schemes to protect programs written in scripting languages use encryption. The program is encrypted but if someone can find or determine the encryption key, they can recover the entire original program including comments, human-readable names, and so forth.
A Simple Example
This is a simple script in Octave.
mytest.m
% test script disp('hello world'); % test comment myflag = 1; printf(\ "this is a \ test\n"); fflush(stdout); myflag = myflag + 1; myflag2 = myflag++; printf("myflag2 is %d\n", myflag2); fflush(stdout); if flag > 1 disp('hi'); else disp('no'); end for counter = 1:10 disp(counter); % test end pivalue = pi; disp(pivalue) disp('ALL DONE');
This script generates the following output under Octave 3.2.4 running on a Windows XP Service Pack 2 PC:
octave-3.2.4.exe:18> mytest hello world this is a test myflag2 is 2 no 1 2 3 4 5 6 7 8 9 10 3.1416 ALL DONE
Here is an obfuscated version of the same Octave script generated by an obfuscation function written by the author in Octave:
mytest_obfuscated.m
disp ( 'hello world' ); ; UQWSKDTZQWRO=1 ; ; printf ( "this is a test\n" ); ; fflush ( stdout ); ; UQWSKDTZQWRO=UQWSKDTZQWRO+1 ; ; BSJRZMSBRYXD=UQWSKDTZQWRO++; ; printf ( "myflag2 is %d\n" , BSJRZMSBRYXD ); ; fflush ( stdout ); ; if flag>1 ; disp ( 'hi' ); ; else ; disp ( 'no' ); ; end ; for RBVZQAHJSNWB=1:10 ; disp ( RBVZQAHJSNWB ); ; end ; VIENISLJPENX=pi ; ; disp ( VIENISLJPENX ) ; disp ( 'ALL DONE' ); ;
Note: On a Windows PC using Firefox, one can select the obfuscated code above by selecting the first few characters at the start of the line above (e.g. disp) and then hitting Shift-End on the keyboard. Then copy and paste to Octave to run the obfuscated code.
This script generates the following output (the same as the original script) under Octave 3.2.4 running on a Windows XP Service Pack 2 PC:
octave-3.2.4.exe:22> mytest_obfuscated hello world this is a test myflag2 is 2 no 1 2 3 4 5 6 7 8 9 10 3.1416 ALL DONE
Note that the reserved keywords such as “if” and built-in Octave functions such as “printf” are not obfuscated. It is actually possible to make the obfuscated code even more unreadable than the example above. This is intended as a simple illustration. The obstacles to reverse engineering and theft introduced by code obfuscation are greater for longer programs and more complex algorithms.
Conclusion
A major problem with Octave and other scripting languages is that it is easy for potential or actual customers or other third parties to reverse engineer or steal algorithms or other sensitive information from a program written in a human readable scripting language. This can be a serious problem for algorithm developers using Octave. This is much less of a problem with compiled languages such as C or C++ in which, however, it is usually slower and more costly to develop algorithms than Octave. Compilers generate unreadable binary files which are difficult to reverse engineer (not impossible).
Computer programs can obfuscate Octave code, automatically removing human readable information such as comments, variable and function names, indentations, and so forth. This is very close to the same information that is removed by compilers when they convert a program written in a compiled programming language such as C or C++ to a binary executable. In some ways, this is more secure than encrypting the code since the information is actually removed entirely from the obfuscated code; the encryption can be broken, often by simply stealing the encryption key. Code obfuscation raises the bar substantially for reverse engineering or stealing an algorithm or other critical intellectual property implemented in Octave. The same comments apply to other scripting languages such as Python, Perl, and Ruby.
© 2011 John F. McGowan
About the Author
John F. McGowan, Ph.D. solves problems by developing complex algorithms that embody advanced mathematical and logical concepts, including video compression and speech recognition technologies. He has extensive experience developing software in C, C++, Visual Basic, Mathematica, MATLAB, and many other programming languages. He is probably best known for his AVI Overview, an Internet FAQ (Frequently Asked Questions) on the Microsoft AVI (Audio Video Interleave) file format. He has worked as a contractor at NASA Ames Research Center involved in the research and development of image and video processing algorithms and technology. He has published articles on the origin and evolution of life, the exploration of Mars (anticipating the discovery of methane on Mars), and cheap access to space. He has a Ph.D. in physics from the University of Illinois at Urbana-Champaign and a B.S. in physics from the California Institute of Technology (Caltech). He can be reached at jmcgowan11@earthlink.net.
I do not think the obfuscated code is the way to protect your code so long as the ‘obfuscated encoding’ is reversible and the decoding code of octave is also open. an authorized IP core (zipped or encrypted using SN code) might be the final way.
But so long in the university, I just using octave to make a prototype of an algorithm, analysis the performance. The final product is always a reimplemented one in fortran or C, packed it up using some simple GUI and send to the customer.
The obfuscated code is not reversible in the sense that the human readable names and similar information cannot be recovered from the obfuscated code. It is not encryption. The obfuscated variable and function names such as UQWSKDTZQWRO are randomly generated strings of characters, not encrypted versions of the original variable and function names. Similarly the comments are completely stripped out.
Of course, one can work through the obfuscated code and with difficulty figure out what it does. This is also true of binary executables generated by compiling C, Fortran, or other compiled languages; in this case, one needs to run a disassembler on the binary executable.
It is generally faster to obfuscate code than to convert it to C or Fortran or a similar language by hand. The relative merits of obfuscation versus conversion to C/Fortran/another compiled language depend on the situation.
Sincerely,
John
I understand the need for obfuscation, but I’m sorry to say that I see little point in this approach. Yes, the lack of comments, and the long and meaningless variable names make this code hard to read. However, the names can be replaced with ‘a’, ‘b’ etc. for slightly better readability. Complicated algorithms will still be hard to understand without comments, but a trained eye can do this in a few days. And in the high-stake cases like the examples you give, there will be enough motivation to try.
Python code can be shipped as byte-code (.pyc file), which provides better protection than obfuscation. Alternatively, the script can be converted to machine code with Cython. Octave developers may also need to provide such solutions.
On the other hand, if a company wants to examine the algorithm, they may want to see the algorithm itself, not just a black box executable. In that case there will be no protection against theft.
In the academic case, I believe that some secrecy, followed by a conference presentation and a paper will be enough to establish priority.
At the end, it matters very little.., mainly if you can have an “Apple” patenting yourself touching your own nose…
br
Beer is not free, speech is not free (protected to an extent, but not “free to do anything.”) Join the fight against nonsensical colloquialisms; say NO to the Free and Open Source “Writing Community”!
Amusingly enough, I tagged this article on the search of Octave “Reverse Engineering” Matlab: As many of Matlab’s functions are not obfuscated in any real form; by mimicking matlab’s core-language, they could practically copy various functions over and just “messy up” the code. Even by looking at the open source component of MatLab they could easily be committing Dirty Reverse Engineering.
Renaming variables is for kiddies. Using gotos, low level versions of various commands, having redundant variables (that have actual calculations being performed on, and combined them with the main program when they == 1 or some other constant)… actually MAKING the code illegible. The bigger the mess you create, the more a potential thief will have to comb through.
Thing is… that type of obfuscation could be done by a personal algorithm; and be made very very complex in how much it sullies the code up… yet people always advise the “does nothing” type… and (for lack of trying) I have not come across much anything that does real obfuscation of script code.