Smart is not enough

Victor Rodriguez Bahena

12 Dec, 2016

Machine Learning performance with Intel® technologies

By Victor Rodriguez, Sr. Software Engineer at Intel® Corporation

Every day the world generates around 2.5 quintillion bytes of data (1 billion gigabytes). Actually, 90% of the world's data today was created in the last 2 years [1]. That huge amount of data needs to be processed and analyzed as quickly as possible so it can provide meaningful information to improve our daily lives.

Effectively dealing with big data means sifting through it to find relevant information, model its elements, and transform it into useful information and knowledge. Using machine learning techniques for modelling, prediction, clustering, and knowledge discovery is the preferred solution. When the complexity or volume of the data to be treated per unit of time is beyond the capacity of human operators and experts, machine learning, as part of data mining, provides methods to automatically treat and extract the information from the data.

The amount of time machine learning systems take to handle these huge volumes of data is crucial. Therefore, big data initiatives should focus not only on the volume of data, but on the speed at which data is processed. Massive distributed computational components increasingly run large-scale machine learning and data analytics tasks. Modern distributed systems, like Spark*, and computational primitives, like Map Reduce*, have gained significant traction because they enable big data analytics [2]. However, anomalous system behaviors and bottlenecks significantly affect the performance of these systems.

One of the factors influencing performance is the operating system (OS) that executes machine learning. The operating system’s handling of the hardware’s capabilities crucially impacts application performance. As the Linux community continues to redefine the boundaries of what is possible for cloud-based Linux distributions running on new silicon, both power and performance play an increasingly important role in the industry.

The Clear Linux* Project decided to use the latest Intel® Advanced Vector Extensions (Intel® AVX) technology. We are making it easier to develop applications that take advantage of new floating point and integer operations of newer servers, starting with Intel® architecture code named Haswell. Intel AVX 1.0 allowed programmers and compilers to do floating point math highly in parallel (SIMD), improving performance in compute-heavy workloads such as high performance computing and machine learning. Intel® Advanced Vector Extensions 2 (Intel® AVX2) introduced floating point math, more supported operations, and support for integer operations. However, as applications compiled for Intel AVX2 will not generally run on processors prior to Haswell, developers favor the older Intel AVX due to compatibility. In Clear Linux we developed a key enhancement to the dynamic loader of glibc. This enhancement lets developers ship two versions of a library: one compiled for Intel AVX2 and one compiled for systems without Intel AVX2. Thus, application performance is tailored for the underlying platform.

One of those libraries is OpenBLAS*. OpenBLAS is an open source implementation of BLAS, or Basic Linear Algebra Subprograms, which provides standard interfaces for linear algebra. Replacing the default linear algebra libraries with OpenBLAS increases speed, especially the computation of the dot function for matrix multiplication used in diverse scientific libraries.

In the Clear Linux* Project, the OpenBLAS package includes the files for systems with Intel AVX2 support and files for systems without Intel AVX2 support. Having two versions of OpenBLAS allows workload handling improvements depending on the available processor capabilities.

OpenBLAS is the core of the Python framework for machine learning, Scikit-learn*. Scikit-learn is a free package in Python which extends the functionality of the NumPy and SciPy packages with numerous data mining algorithms [3]. The package keeps improving by accepting valuable contributions from many sources.

A few weeks ago, the Phoronix* magazine posted benchmarks results under the title “Machine Deep Learning CPU Linux Distro Tests”.  Some of the tests used to measure the performance of these diverse operating systems, including Clear Linux* Project, are part of a deep learning framework developed by the Berkeley Vision and Learning Center. Simultaneously, several Python libraries for scientific computing and machine learning development were used to generate applications where machine learning helps solve day to day programs. The graph shows the significant benefits gained from enabling OpenBLAS with Intel AVX2 acceleration.

However, in today's data center world, the need to deploy solutions and services to customers in a matter of minutes forces the operating systems engineers to check if a solution applies to virtualization systems. Containers are just one such system. The containers can run an application and its dependencies in resource-isolated processes. From a data center administrator’s point of view, the way current users and applications see the new features and performance optimization from an operating system is through virtualization and containers environment.

Phoronix* magazine published an article depicting how various performance benchmarks run under different Docker container operating system images. The results consistently showed an outstanding performance improvement, even outperforming the bare metal host system running the Docker container images. The performance gap was due to each image using its own version of the libraries the applications needed to run. In the case of the Clear Linux Project, these libraries were compiled to enhance the power of AVX2 technology.

The benefits of AVX2 are not limited to these examples. Many other scientific programming and analytics tools could improve. For example, the R programming language*, an important tool for researchers in the numerical analysis and machine learning spaces, uses OpenBLAS as a basic linear algebra library. Use of OpenBLAS with AVX2 support yielded a 3X performance improvement as shown in the latest results of R benchmark* on the Phoronix* website.

In a world where hundred of thousands of gigabytes are generated and analysed every day, we cannot afford to waste hardware resources which could be used to reduce compute time. With technology like AVX-512 coming the next year, the potential for performance improvements in machine learning and big data analytics seems limitless. As a software developer, I see it as our responsibility to push the limits as far as our imagination allows, because, as we have seen in today's world, smart is not enough.


  1. Matthew Wall “Big Data: Are you ready for blast-off?” Business reporter, BBC News, 4 March 2014

  2. J. L. Berral-García, "A quick view on current techniques and machine learning algorithms for big data analytics," 2016 18th International Conference on Transparent Optical Networks (ICTON), Trento, 2016, pp. 1-4. doi: 10.1109/ICTON.2016.7550517

  3. A. Jovic, K. Brkic and N. Bogunovic, "An overview of free software tools for general data mining," Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2014 37th International Convention on, Opatija, 2014, pp. 1112-1117. doi: 10.1109/MIPRO.2014.6859735

Legal disclaimer

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at, or from the OEM or retailer.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase.  For more complete information about performance and benchmark results, visit

Intel, the Intel logo and others are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.