By Victor Rodriguez

The early days of UNIX development saw a tool-interpreted programming language designed for text processing; it was AWK. The basic function of AWK is to search files for lines (or other units of text) that contain certain patterns. When a line matches one of the patterns, AWK performs specified actions on that line. AWK continues to process input lines in this way until it reaches the end of the input files.

GNU/Linux distributes the version of the AWK, which is written and maintained by the Free Software Foundation (FSF) and often referred as GNU AWK (GAWK). GAWK provides a large number of extensions over POSIX awk.

Multiple tasks can be done with AWK; a few examples are: text processing, producing formatted text reports, arithmetic operations, string operations, and many more. The importance of improving the performance of the AWK programming language is obvious. One of these areas is  text processing for big data analysis.  

Text processing is a major field in cloud technology. Our world is being revolutionized by data-driven methods: access to large quantities of data has generated opportunities for new insights into commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advancements requires tools like AWK, and those with better performance based on user needs.

For this reason, the Clear Linux* Project team chose Automatic Feedback Direct Optimizer (AutoFDO) to improve the performance of AWK. AutoFDO uses sampling-based profiler data to drive feedback-directed optimizations. It gathers this data with perf(link is external), collecting sample profiles, and uses a stand-alone tool to convert the perf.data file into gcov format. The source code of the tool can be found here(link is external).

To generate a good collection of sample profiles, it is necessary to have a robust benchmark that exercises most of the paths of the tool.  There are numerous ways  to measure the performance of AWK. In one measurement we saw it using simple text processing of a 100 million lines-long file. One measurement showed an improvement of up to 18% realized in terms of execution time after rebuilding the AWK code.

Summary

We think profile-based optimization techniques can benefit the entire Linux community, as well as the cloud industry, to  improve the performance of their analytics systems.  With the improvements in the AWK language, the Clear Linux* Project continues to push the performance boundaries for what is possible in a cloud-based Linux® distribution running on Intel silicon.