Data Analytics Reference Stack (DARS) v1.0 Now Available

In 2017, the average fiber to the home (FTTH) household generated 85GB of Internet traffic and is expected to generate approximately 264GB1 of Internet traffic per month in 2022. For comparison, a smart car will generate 50GB, a smart hospital 3,000 GB, a plane 40TB, and a city safety system 50PB—in a single day2. And these predictions are for 2019; by 2022 there will be 3X more connected devices (28.5 billion1) than the global populationwhich means 3x more traffic. The quantity of data generated is difficult to comprehend, much less make actionable.

This exponential growth in volume and variety of data provides enterprises a tremendous opportunity to gain a competitive edge through analytics-driven insights. Those who turn the mountains of information into actionable intelligence will be positioned to make business operations more efficient, drive faster innovation, and deliver improved security.

With this goal in mind, Intel is releasing a Data Analytics Reference Stack (DARS) to help enterprises analyze, classify, recognize, and process large amounts of data. Using a modern system stack such as this, built on Intel® Xeon® Scalable platforms and featuring software optimizations at each layer, enterprise customers and developers can gain a significant performance boost, from hardware up to the application layer.

This ready-to-use stack gives application developers and architects a powerful way to store and process large amounts of data by using a distributed processing framework to efficiently build big-data solutions and solve domain-specific problems. Having a streamlined system stack frees users from the complexity of integrating multiple components and software versions, and delivers a stable, performant platform upon which to quickly develop, test, and deploy solutions. Some key attributes include:

Apache Spark* (v2.4.0), an open source distributed general purpose cluster-computing framework. This helps developers and data scientists to rapidly analyze, and transform data at scale.

Apache Hadoop* 3.2.0, an open source framework allowing for distributed processing of large data sets across clusters of computers using simple programming models. This framework is designed to scale up from a few servers to thousands of machines, each offering local computation and storage.

OpenJDK11*, an open source reference implementation of version 11 of the Java* SE platform. This is one of the most widely-used programming languages for building enterprise-grade applications.

The Data Analytics Reference Stack can be used as a multi-node architecture, for faster input, storage, and analysis of large data sets.

Thumbnail

In addition to new features, this release incorporates the latest versions of developer tools and frameworks, namely:

  • Operating System: Clear Linux* OS, customized to individual development needs and optimized for Intel platforms, including specific use cases like Deep Learning.
  • Orchestration: Kubernetes* to manage and orchestrate containerized applications for multi-node clusters with Intel platform awareness.
  • Containers: Docker* Containers and Kata* Containers with Intel® VT Technology to help secure containers.
  • Libraries: Intel® Math Kernel Library (MKL), a highly-optimized math library for mathematical function performance and OpenBLAS*, an open source implementation of the Basic Linear Algebra Subprograms (BLAS) API for optimized implementations of linear algebra kernels.
  • Runtimes: Python* application and service execution support that is optimized for IA and Java, a general-purpose programming language that is class-based, object-oriented, and specifically designed to have as few implementation dependencies as possible. 
  • Frameworks: Apache Spark* and Apache Hadoop*

Since each layer of the Data Analytics Stack has been tested and tuned for performance on Intel Architecture the impact is clear especially when you look at the performance gains realized when using our stack, versus non-optimized stacks, for data analytics workloads.

Performance gains for the Data Analytics Reference Stack (DARS) utilizing Intel® Math Kernel Library (Intel® MKL) as follows:

data analytics-mkl speedup vs baseline chart

Baseline performance shown as 1x.
Configuration details shown below.
Alternating Least Squares (ALS) is a machine learning workload.
Singular Value Decomposition (SVD) is machine learning workload.
K-means clustering (Kmeans) is a machine learning workload.

Performance gains for the Data Analytics Reference Stack (DARS) utilizing the open source implementation of the Basic Linear Algebra Subprograms (OpenBLAS) as follows:

data analytics-openblas speedup vs baseline chart

Baseline performance shown as 1x
Configuration details shown below.
Singular Value Decomposition (SVD) is machine learning workload.
K-means clustering (Kmeans) is a machine learning workload.
Generalized Linear Classification Model is a machine learning workload.

Detailed server configuration:

Hardware Configuration for Data Nodes

Nodes

1 master + 4 worker nodes

CPU sockets/node

2

CPU

Cores / Threads

Clock : Base / Turbo

L3 Cache

Intel® Xeon Gold 6140

18 Cores / 36 Threads

2.3 GHz / 3.7 GHz

24.75 MB  L3

Memory/Node

384 GB

12 * 32 GB DDR4 DIMMs

Rated @ 2400 MHz

Operating @ 2400 MHz

Storage/Node

5.6 TB

7 * 800GB SATA3 SSD

Network

10 Gbps Ethernet

Detailed software configuration:

Component

Baseline

Data Analytics Reference Stack (with Intel® MKL)

Data Analytics Reference Stack (with OpenBLAS)

Operating System

CentOS*  release 7.6.1810

Clear Linux 29100

Clear Linux 29100

Kernel

4.20.0-1.e17.elrepo.x86_64

5.0.2-717

5.0.2-717

Java

OpenJDK-1.8_191

OpenJDK-11.0.2

OpenJDK-11.0.2

Math Library

F2JBLAS 1.1

MKL 2018.3.222

OpenBLAS 0.2.20

Apache Hadoop*

Cloudera* CDH-5.12

Apache Hadoop* 3.2.0 + patches#

Apache Hadoop* 3.2.0 + patches#

Apache Spark*

Apache Spark* 1.6 (from CDH-5.12)

Apache Spark* 2.4.0 + patches#

Apache Spark* 2.4.0 + patches#

Scala

2.11

2.12.4

2.12.4

Filesystem

HDFS (RF=3)

HDFS (RF=3)

HDFS (RF=3)

#additional patches listed below under system configuration

Intel works across the industry to help ensure popular frameworks and topologies run well on Intel Architecture, giving customers a choice in the best solution for their needs. We are using this stack to innovate on our current Intel ® Xeon® Scalable processors and plan to continue performance optimizations for coming generations.

We invite developers to contribute feedback and ideas for future versions of the Data Analytics Reference Stack. For more information, please visit our Clear Linux OS Stacks page. To join the Clear Linux community, join our developer mailing list, the #intel-verticals IRC channel, or our GitHub* repository.

1. Source: http://www.cisco.com/c/en/us/solutions/service-provider/vni-network-traffic-forecast/infographic.html, http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-741490.pdf

2. Source: https://www.cisco.com/c/dam/m/en_us/service-provider/ciscoknowledgenetwork/files/547_11_10-15-DocumentsCisco_GCI_Deck_2014-2019_for_CKN__10NOV2015_.pdf


Performance results are based on testing as of May, 2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product or component can be absolutely secure.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks. See configuration details above.

System Configuration

Additional software patches: http://clearlinux.org/documentation/clear-linux/tutorials/dars

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Intel, the Intel logo, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others. The nominative use of third party logos serves only the purposes of description and identification.