diff --git a/DEVEL.md b/DEVEL.md
deleted file mode 100644
index f2ee20bc3..000000000
--- a/DEVEL.md
+++ /dev/null
@@ -1,901 +0,0 @@
-# Sparkling Water Development Documentation
-
-## Table of Contents
-- [Typical Use Case](#UseCase)
-- [Requirements](#Req)
-- [Design](#Design)
-- [Features](#Features)
- - [Supported Data Sources](#DataSource)
- - [Supported Data Formats](#DataFormat)
- - [Data Sharing](#DataShare)
- - [Provided Primitives](#ProvPrim)
-- [Running on Select Target Platforms](#TargetPlatforms)
- - [Local](#Local)
- - [Standalone](#Standalone)
- - [YARN](#YARN)
- - [Mesos](#Mesos)
-- [H2O Initialization Sequence](#H2OInit)
- - [Configuration](#Config)
- - [Build Environment](#BuildEnv)
- - [Run Environment](#RunEnv)
- - [Sparkling Water Configuration Properties](#Properties)
-- [Running Sparkling Water](#RunSW)
- - [Starting H2O Services](#StartH2O)
- - [Memory Allocation](#MemorySetup)
- - [Security](#Security)
- - [Converting H2OFrame into RDD](#ConvertDF)
- - [Example](#Example)
- - [Converting H2OFrame into DataFrame](#ConvertSchema)
- - [Example](#Example2)
- - [Converting RDD into H2OFrame](#ConvertRDD)
- - [Example](#Example3)
- - [Converting DataFrame into H2OFrame](#ConvertSchematoDF)
- - [Example](#Example4)
- - [Creating H2OFrame from an existing key](#CreateDF)
- - [Calling H2O Algorithms](#CallAlgos)
- - [Running Unit Tests](#UnitTest)
-- [Integration Tests](#IntegTest)
- - [Testing Environment](#TestEnv)
- - [Testing Scenarios](#TestCases)
- - [Integration Tests Example](#IntegExample)
-- [Troubleshooting and Log Locations](#Logging)
-- [Sparkling Shell Console Output](#log4j)
-- [H2O Frame as Spark's Data Source](#DataSource)
- - [Usage in Python - pySparkling](#DataSourcePython)
- - [Usage in Scala](#DataSourceScala)
- - [Specifying Saving Mode](#SavingMode)
-- [Sparkling Water Tuning](#SparklingWaterTuning)
-- [Sparkling Water and Zeppelin](#SparklingWaterZeppelin)
-
----
-
-
-## Typical Use-Case
-Sparkling Water excels in leveraging existing Spark-based workflows that need to call advanced machine learning algorithms. A typical example involves data munging with help of Spark API, where a prepared table is passed to the H2O DeepLearning algorithm. The constructed DeepLearning model estimates different metrics based on the testing data, which can be used in the rest of the Spark workflow.
-
----
-
-
-## Requirements
- - Linux or Mac OSX platform
- - Java 1.7+
- - [Spark 1.6.0+](http://spark.apache.org/downloads.html)
-
----
-
-
-## Design
-
-Sparkling Water is designed to be executed as a regular Spark application.
-It provides a way to initialize H2O services on each node in the Spark cluster and access
-data stored in data structures of Spark and H2O.
-
-Since Sparkling Water is designed as Spark application, it is launched
-inside a Spark executor, which is created after application submission.
-At this point, H2O starts services, including distributed KV store and memory manager,
-and orchestrates them into a cloud. The topology of the created cloud matches the topology of the underlying Spark cluster exactly.
-
- 
-
-When H2O services are running, it is possible to create H2O data structures, call H2O algorithms, and transfer values from/to RDD.
-
----
-
-
-## Features
-
-Sparkling Water provides transparent integration for the H2O engine and its machine learning
-algorithms into the Spark platform, enabling:
- * use of H2O algorithms in Spark workflow
- * transformation between H2O and Spark data structures
- * use of Spark RDDs as input for H2O algorithms
- * transparent execution of Sparkling Water applications on top of Spark
-
-
-
-### Supported Data Sources
-Currently, Sparkling Water can use the following data source types:
- - standard RDD API to load data and transform them into `H2OFrame`
- - H2O API to load data directly into `H2OFrame` from:
- - local file(s)
- - HDFS file(s)
- - S3 file(s)
-
----
-
-### Supported Data Formats
-Sparkling Water can read data stored in the following formats:
-
- - CSV
- - SVMLight
- - ARFF
-
-
----
-
-### Data Sharing
-Sparkling Water enables transformation between different types of Spark `RDD` and H2O's `H2OFrame`, and vice versa.
-
- 
-
-When converting from `H2OFrame` to `RDD`, a wrapper is created around the `H2OFrame` to provide an RDD-like API. In this case, no data is duplicated; instead, the data is served directly from then underlying `H2OFrame`.
-
-Converting in the opposite direction (i.e, from Spark `RDD`/`DataFrame` to `H2OFrame`) needs evaluation of data stored in Spark `RDD` and transfer them from RDD storage into `H2OFrame`. However, data stored in `H2OFrame` is heavily compressed.
-
-
-----
-
-### Provided Primitives
-The Sparkling Water provides following primitives, which are the basic classes used by Spark components:
-
-
-| Concept | Implementation class | Description |
-|----------------|-----------------------------------|-------------|
-| H2O context | `org.apache.spark.h2o.H2OContext` | H2O context that holds H2O state and provides primitives to transfer RDD into H2OFrame and vice versa. It follows design principles of Spark primitives such as `SparkContext` or `SQLContext` |
-| H2O entry point| `water.H2O` | Represents the entry point for accessing H2O services. It holds information about the actual H2O cluster, including a list of nodes and the status of distributed K/V datastore. |
-| H2O H2OFrame | `water.fvec.H2OFrame` | H2OFrame is the H2O data structure that represents a table of values. The table is column-based and provides column and row accessors. |
-| H2O Algorithms | package `hex` | Represents the H2O machine learning algorithms library, including DeepLearning, GBM, RandomForest. |
-
-
----
-
-
-# Running on Select Target Platforms
-
-Sparkling Water can run on top of Spark in the various ways described in the following sections.
-
-
-## Local
-In this case Sparkling Water runs as a local cluster (Spark master variable points to one of values `local`, `local[*]`, or `local-cluster[...]`
-
-
-## Standalone
-[Spark documentation - running Standalone cluster](http://spark.apache.org/docs/latest/spark-standalone.html)
-
-
-## YARN
-[Spark documentation - running Spark Application on YARN](http://spark.apache.org/docs/latest/running-on-yarn.html)
-
-When submitting Sparkling Water application to CHD or Apache Hadoop cluster, the command to submit may look like:
-```
-./spark-submit --master=yarn-client --class water.SparklingWaterDriver --conf "spark.yarn.am.extraJavaOptions=-XX:MaxPermSize=384m -Dhdp.version=current"
---driver-memory=8G --num-executors=3 --executor-memory=3G --conf "spark.executor.extraClassPath=-XX:MaxPermSize=384m -Dhdp.version=current"
-sparkling-water-assembly-1.5.11-all.jar
-```
-
-When submitting sparkling water application to HDP Cluster, the command to submit may look like:
-```
-./spark-submit --master=yarn-client --class water.SparklingWaterDriver --conf "spark.yarn.am.extraJavaOptions=-XX:MaxPermSize=384m -Dhdp.version=current"
---driver-memory=8G --num-executors=3 --executor-memory=3G --conf "spark.executor.extraClassPath=-XX:MaxPermSize=384m -Dhdp.version=current"
-sparkling-water-assembly-1.5.11-all.jar
-```
-Apart from the typical spark configuration it is necessary to add `-XX:MaxPermSize=384m` (or higher, but 384m is minimum) to both `spark.executor.extraClassPath` and `spark.yarn.am.extraJavaOptions` (or for client mode, `spark.driver.extraJavaOptions` for cluster mode) configuration properties in order to run Sparkling Water correctly.
-
-The only difference between HDP cluster and both CDH and Apache hadoop clusters is that we need to add `-Dhdp.version=current` to both `spark.executor.extraClassPath` and `spark.yarn.am.extraJavaOptions` (resp., `spark.driver.extraJavaOptions`) configuration properties in the HDP case.
-
-
-## Mesos
-[Spark documentation - running Spark Application on Mesos](http://spark.apache.org/docs/latest/running-on-mesos.html)
-
-
----
-
-# H2O Initialization Sequence
-If `SparkContext` is available, initialize and start H2O context:
-```scala
-val sc:SparkContext = ...
-val hc = H2OContext.getOrCreate(sc)
-```
-
-The call will:
- 1. Collect the number and host names of the executors (worker nodes) in the Spark cluster
- 2. Launch H2O services on each detected executor
- 3. Create a cloud for H2O services based on the list of executors
- 4. Verify the H2O cloud status
-
-The former variant is preferred, because it initiates and starts H2O Context in one call and also can be used to obtain already existing H2OContext, but it does semantically the same as the latter variant.
-
----
-
-## Configuration
-
-
-### Build Environment
-The environment must contain the property `SPARK_HOME` that points to the Spark distribution.
-
----
-
-
-### Run Environment
-The environment must contain the property `SPARK_HOME` that points to the Spark distribution.
-
----
-
-
-### Sparkling Water Configuration Properties
-
-All available Sparkling Water configuration properties are listed at [Sparkling Water Properties](doc/configuration_properties.rst).
-
-
-# Running Sparkling Water
-
----
-
-
-### Starting H2O Services
-```scala
-val sc:SparkContext = ...
-val hc = H2OContext.getOrCreate(sc)
-```
-
----
-
-### Memory Allocation
-
-H2O resides in the same executor JVM as Spark. The memory provided for H2O is configured via Spark; refer to [Spark configuration](http://spark.apache.org/docs/1.4.0/configuration.html) for more details.
-
-#### Generic configuration
- * Configure the Executor memory (i.e., memory available for H2O) via the Spark configuration property `spark.executor.memory` .
- > For example, `bin/sparkling-shell --conf spark.executor.memory=5g` or configure the property in `$SPARK_HOME/conf/spark-defaults.conf`
-
- * Configure the Driver memory (i.e., memory available for H2O client running inside Spark driver) via the Spark configuration property `spark.driver.memory`
- > For example, `bin/sparkling-shell --conf spark.driver.memory=4g` or configure the property in `$SPARK_HOME/conf/spark-defaults.conf`
-
-#### Yarn specific configuration
-* Refer to the [Spark documentation](http://spark.apache.org/docs/1.4.0/running-on-yarn.html)
-
-* For JVMs that require a large amount of memory, we strongly recommend configuring the maximum amount of memory available for individual mappers. For information on how to do this using Yarn, refer to http://docs.h2o.ai/deployment/hadoop_yarn.html
-
----
-
-### Security
-
-Both Spark and H2O support basic node authentication and data encryption. In H2O's case we encrypt all the data sent between server nodes and between client
-and server nodes. This feature does not support H2O's UDP feature, only data sent via TCP is encrypted.
-
-Currently only encryption based on Java's key pair is supported (more in-depth explanation can be found in H2O's documentation linked below).
-
-To enable security for Spark methods please check [their documentation](http://spark.apache.org/docs/latest/security.html).
-
-Security for data exchanged between H2O instances can be enabled manually by generating all necessary files and distributing them to all worker nodes as
-described in [H2O's documentation](https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/security.rst#ssl-internode-security) and passing the "spark.ext.h2o
-.internal_security_conf" to spark submit:
-
-```scala
-bin/sparkling-shell /
---conf "spark.ext.h2o.internal_security_conf=ssl.properties"
-```
-
-We also provide utility methods which will automatically generate all necessary files and enable security on all H2O nodes:
-
-```
-import org.apache.spark.network.Security
-import org.apache.spark.h2o._
-Security.enableSSL(sc) // generate properties file, key pairs and set appropriate H2O parameters
-val hc = H2OContext.getOrCreate(sc) // start the H2O cloud
-```
-
-Or if you plan on passing your own H2OConf then please use:
-
-```
-import org.apache.spark.network.Security
-import org.apache.spark.h2o._
-val conf: H2OConf = // generate H2OConf file
-Security.enableSSL(sc, conf) // generate properties file, key pairs and set appropriate H2O parameters
-val hc = H2OContext.getOrCreate(sc, conf) // start the H2O cloud
-```
-
-This method will generate all files and distribute them via YARN or Spark methods to all worker nodes. This communication will be secure if you configured
-YARN/Spark security.
-
----
-
-### Converting H2OFrame into RDD[T]
-The `H2OContext` class provides the explicit conversion, `asRDD`, which creates an RDD-like wrapper around the provided H2O's H2OFrame:
-```scala
-def asRDD[A <: Product: TypeTag: ClassTag](fr : H2OFrame) : RDD[A]
-```
-
-The call expects the type `A` to create a correctly-typed RDD.
-The conversion requires type `A` to be bound by `Product` interface.
-The relationship between the columns of H2OFrame and the attributes of class `A` is based on name matching.
-
-
-#### Example
-```scala
-val df: H2OFrame = ...
-val rdd = asRDD[Weather](df)
-
-```
----
-
-
-### Converting H2OFrame into DataFrame
-The `H2OContext` class provides the explicit conversion, `asDataFrame`, which creates a DataFrame-like wrapper
-around the provided H2O H2OFrame. Technically, it provides the `RDD[sql.Row]` RDD API:
-```scala
-def asDataFrame(fr : H2OFrame)(implicit sqlContext: SQLContext) : DataFrame
-```
-
-This call does not require any type of parameters, but since it creates `DataFrame` instances, it requires access to an instance of `SQLContext`. In this case, the instance is provided as an implicit parameter of the call. The parameter can be passed in two ways: as an explicit parameter or by introducing an implicit variable into the current context.
-
-The schema of the created instance of the `DataFrame` is derived from the column name and the types of `H2OFrame` specified.
-
-
-#### Example
-
-Using an explicit parameter in the call to pass sqlContext:
-```scala
-val sqlContext = new SQLContext(sc)
-val schemaRDD = asDataFrame(h2oFrame)(sqlContext)
-```
-or as implicit variable provided by actual environment:
-```scala
-implicit val sqlContext = new SQLContext(sc)
-val schemaRDD = asDataFrame(h2oFrame)
-```
----
-
-
-### Converting RDD[T] into H2OFrame
-The `H2OContext` provides **implicit** conversion from the specified `RDD[A]` to `H2OFrame`. As with conversion in the opposite direction, the type `A` has to satisfy the upper bound expressed by the type `Product`. The conversion will create a new `H2OFrame`, transfer data from the specified RDD, and save it to the H2O K/V data store.
-
-
-```scala
-implicit def asH2OFrame[A <: Product : TypeTag](rdd : RDD[A]) : H2OFrame
-```
-
-The API also provides explicit version which allows for specifying name for resulting
-H2OFrame.
-
-```scala
-def asH2OFrame[A <: Product : TypeTag](rdd : RDD[A], frameName: Option[String]) : H2OFrame
-```
-
-
-#### Example
-```scala
-val rdd: RDD[Weather] = ...
-import h2oContext._
-// implicit call of H2OContext.asH2OFrame[Weather](rdd) is used
-val hf: H2OFrame = rdd
-// Explicit call of of H2OContext API with name for resulting H2O frame
-val hfNamed: H2OFrame = h2oContext.asH2OFrame(rdd, Some("h2oframe"))
-```
-
-
----
-
-### Converting DataFrame into H2OFrame
-The `H2OContext` provides **implicit** conversion from the specified `DataFrame` to `H2OFrame`. The conversion will create a new `H2OFrame`, transfer data from the specified `DataFrame`, and save it to the H2O K/V data store.
-
-```scala
-implicit def asH2OFrame(rdd : DataFrame) : H2OFrame
-```
-
-The API also provides explicit version which allows for specifying name for resulting
-H2OFrame.
-
-```scala
-def asH2OFrame(rdd : DataFrame, frameName: Option[String]) : H2OFrame
-```
-
-
-#### Example
-```scala
-val df: DataFrame = ...
-import h2oContext._
-// Implicit call of H2OContext.asH2OFrame(srdd) is used
-val hf: H2OFrame = df
-// Explicit call of H2Context API with name for resulting H2O frame
-val hfNamed: H2OFrame = h2oContext.asH2OFrame(df, Some("h2oframe"))
-```
----
-
-
-### Creating H2OFrame from an existing Key
-
-If the H2O cluster already contains a loaded `H2OFrame` referenced by the key `train.hex`, it is possible
-to reference it from Sparkling Water by creating a proxy `H2OFrame` instance using the key as the input:
-```scala
-val trainHF = new H2OFrame("train.hex")
-```
-
-### Type mapping between H2O H2OFrame types and Spark DataFrame types
-
-For all primitive Scala types or Spark SQL (see `org.apache.spark.sql.types`) types which can be part of Spark RDD/DataFrame we provide mapping into H2O vector types (numeric, categorical, string, time, UUID - see `water.fvec.Vec`):
-
-| Scala type | SQL type | H2O type |
-|------------|------------| ---------|
-| _NA_ | BinaryType | Numeric |
-| Byte | ByteType | Numeric |
-| Short | ShortType | Numeric |
-|Integer | IntegerType| Numeric |
-|Long | LongType | Numeric |
-|Float | FloatType | Numeric |
-|Double | DoubleType | Numeric |
-|String | StringType | String |
-|Boolean | BooleanType| Numeric |
-|java.sql.Timestamp| TimestampType | Time|
-
----
-
-### Type mapping between H2O H2OFrame types and RDD\[T\] types
-
-As type T we support following types:
-
-| T |
-|------------|
-| _NA_ |
-| Byte |
-| Short |
-|Integer |
-|Long |
-|Float |
-|Double |
-|String |
-|Boolean |
-|java.sql.Timestamp |
-|Any scala class extending scala `Product` |
-|org.apache.spark.mllib.regression.LabeledPoint|
-
-As is specified in the table, Sparkling Water provides support for transforming arbitrary scala class extending `Product`, which are for example all case classes.
-
-
-
----
-
-
-
-### Calling H2O Algorithms
-
- 1. Create the parameters object that holds references to input data and parameters specific for the algorithm:
- ```scala
- val train: RDD = ...
- val valid: H2OFrame = ...
-
- val gbmParams = new GBMParameters()
- gbmParams._train = train
- gbmParams._valid = valid
- gbmParams._response_column = 'bikes
- gbmParams._ntrees = 500
- gbmParams._max_depth = 6
- ```
- 2. Create a model builder:
- ```scala
- val gbm = new GBM(gbmParams)
- ```
- 3. Invoke the model build job and block until the end of computation (`trainModel` is an asynchronous call by default):
- ```scala
- val gbmModel = gbm.trainModel.get
- ```
----
-
-## Running Unit Tests
-To invoke tests, the following JVM options are required:
- - `-Dspark.testing=true`
- - `-Dspark.test.home=/Users/michal/Tmp/spark/spark-1.5.1-bin-cdh4/`
-
-
-## Application Development
-You can find Sparkling Water self-contained application skeleton in [Droplet repository](https://github.com/h2oai/h2o-droplets/tree/master/sparkling-water-droplet).
-
-## Sparkling Water configuration
-
- - TODO: used datasources, how data is moved to spark
- - TODO: platform testing - mesos, SIMR
-
-
----
-
-# Integration Tests
-
----
-
-## Testing Environments
- * Local - corresponds to setting Spark `MASTER` variable to one of `local`, or `local[*]`, or `local-cluster[_,_,_]` values
- * Standalone cluster - the `MASTER` variable points to existing standalone Spark cluster `spark://...`
- * ad-hoc build cluster
- * CDH5.3 provided cluster
- * YARN cluster - the `MASTER variable contains `yarn-client` or `yarn-cluster` values
-
----
-
-
-## Testing Scenarios
- 1. Initialize H2O on top of Spark by running `H2OContext.getOrCreate(sc)` and verifying that H2O was properly initialized on all Spark nodes.
- 2. Load data with help from the H2O API from various data sources:
- * local disk
- * HDFS
- * S3N
- 3. Convert from `RDD[T]` to `H2OFrame`
- 4. Convert from `DataFrame` to `H2OFrame`
- 5. Convert from `H2OFrame` to `RDD`
- 6. Convert from `H2OFrame` to `DataFrame`
- 7. Integrate with H2O Algorithms using RDD as algorithm input
- 8. Integrate with MLlib Algorithms using H2OFrame as algorithm input (KMeans)
- 9. Integrate with MLlib pipelines (TBD)
-
----
-
-
-## Integration Tests Example
-
-The following code reflects the use cases listed above. The code is executed in all testing environments (if applicable):
- * local
- * standalone cluster
- * YARN
-Spark 1.6.0+ is required.
-
-1. Initialize H2O:
-
- ```scala
- import org.apache.spark.h2o._
- val sc = new SparkContext(conf)
- val h2oContext = H2OContext.getOrCreate(sc)
- import h2oContext._
- ```
-2. Load data:
- * From the local disk:
-
- ```scala
- val sc = new SparkContext(conf)
- import org.apache.spark.h2o._
- val h2oContext = H2OContext.getOrCreate(sc)
- import java.io.File
- val df: H2OFrame = new H2OFrame(new File("examples/smalldata/allyears2k_headers.csv.gz"))
- ```
- > Note: The file must be present on all nodes.
-
- * From HDFS:
-
- ```scala
- val sc = new SparkContext(conf)
- import org.apache.spark.h2o._
- val h2oContext = H2OContext.getOrCreate(sc)
- val path = "hdfs://mr-0xd6.0xdata.loc/datasets/airlines_all.csv"
- val uri = new java.net.URI(path)
- val airlinesHF = new H2OFrame(uri)
- ```
- * From S3N:
-
- ```scala
- val sc = new SparkContext(conf)
- import org.apache.spark.h2o._
- val h2oContext = H2OContext.getOrCreate(sc)
- val path = "s3n://h2o-airlines-unpacked/allyears2k.csv"
- val uri = new java.net.URI(path)
- val airlinesHF = new H2OFrame(uri)
- ```
- > Spark/H2O needs to know the AWS credentials specified in `core-site.xml`. The credentials are passed via `HADOOP_CONF_DIR` that points to a configuration directory with `core-site.xml`.
-3. Convert from `RDD[T]` to `H2oFrame`:
-
- ```scala
- val sc = new SparkContext(conf)
- import org.apache.spark.h2o._
- val h2oContext = H2OContext.getOrCreate(sc)
- val rdd = sc.parallelize(1 to 1000, 100).map( v => IntHolder(Some(v)))
- val hf: H2OFrame = h2oContext.asH2OFrame(rdd)
- ```
-4. Convert from `DataFrame` to `H2OFrame`:
-
- ```scala
- val sc = new SparkContext(conf)
- import org.apache.spark.h2o._
- val h2oContext = H2OContext.getOrCreate(sc)
- import org.apache.spark.sql._
- val sqlContext = new SQLContext(sc)
- import sqlContext.implicits._
- val df: DataFrame = sc.parallelize(1 to 1000, 100).map(v => IntHolder(Some(v))).toDF
- val hf = h2oContext.asH2OFrame(df)
- ```
-5. Convert from `H2OFrame` to `RDD[T]`:
-
- ```scala
- val sc = new SparkContext(conf)
- import org.apache.spark.h2o._
- val h2oContext = H2OContext.getOrCreate(sc)
- val rdd = sc.parallelize(1 to 1000, 100).map(v => IntHolder(Some(v)))
- val hf: H2OFrame = h2oContext.asH2OFrame(rdd)
- val newRdd = h2oContext.asRDD[IntHolder](hf)
- ```
-6. Convert from `H2OFrame` to `DataFrame`:
-
- ```scala
- val sc = new SparkContext(conf)
- import org.apache.spark.h2o._
- val h2oContext = H2OContext.getOrCreate(sc)
- import org.apache.spark.sql._
- val sqlContext = new SQLContext(sc)
- import sqlContext.implicits._
- val df: DataFrame = sc.parallelize(1 to 1000, 100).map(v => IntHolder(Some(v))).toDF
- val hf = h2oContext.asH2OFrame(df)
- val newRdd = h2oContext.asDataFrame(hf)(sqlContext)
- ```
-7. Integrate with H2O Algorithms using RDD as algorithm input:
-
- ```scala
- val sc = new SparkContext(conf)
- import org.apache.spark.h2o._
- import org.apache.spark.examples.h2o._
- val h2oContext = H2OContext.getOrCreate(sc)
- val path = "examples/smalldata/prostate.csv"
- val prostateText = sc.textFile(path)
- val prostateRDD = prostateText.map(_.split(",")).map(row => ProstateParse(row))
- import _root_.hex.tree.gbm.GBM
- import _root_.hex.tree.gbm.GBMModel.GBMParameters
- import h2oContext._
- val train: H2OFrame = prostateRDD
- val gbmParams = new GBMParameters()
- gbmParams._train = train
- gbmParams._response_column = 'CAPSULE
- gbmParams._ntrees = 10
- val gbmModel = new GBM(gbmParams).trainModel.get
- ```
-8. Integrate with MLlib algorithms:
-
- ```scala
- val sc = new SparkContext(conf)
- import org.apache.spark.h2o._
- import org.apache.spark.examples.h2o._
- import java.io.File
- val h2oContext = H2OContext.getOrCreate(sc)
- val path = "examples/smalldata/prostate.csv"
- val prostateHF = new H2OFrame(new File(path))
- val prostateRDD = h2oContext.asRDD[Prostate](prostateHF)
- import org.apache.spark.mllib.clustering.KMeans
- import org.apache.spark.mllib.linalg.Vectors
- val train = prostateRDD.map( v => Vectors.dense(v.CAPSULE.get*1.0, v.AGE.get*1.0, v.DPROS.get*1.0,v.DCAPS.get*1.0, v.GLEASON.get*1.0))
- val clusters = KMeans.train(train, 5, 20)
- ```
-
----
-
-
-## Troubleshooting and Log Locations
-In the event you hit a bug or find that Sparkling Water is not reacting the way it is suppose to, help us improve the product by sending the
-[H2O.ai team](support@h2o.ai) the logs. Depending on how you launched H2O there are a couple of ways to obtain the logs.
-
-
-### Logs for Standalone Sparkling Water
-By default Spark sets SPARK_LOG_DIR is set to $SPARK_HOME/work/ and if logging needs to be enabled. So when launching Sparkling Shell run:
-
- ```
- bin/sparkling-shell.sh --conf spark.logConf=true
- ```
-
-Zip up the log files in $SPARK_HOME/work/ and the directory should contain the assembly jar file and stdout and stderr for
-each node in the cluster.
-
-
-
-### Logs for Sparkling Water on YARN
-When launching Sparkling Water on YARN, you can find the application id for the Yarn job on the resource manager where you can also find
-the application master which is also the Spark master. Then run to get the yarn logs:
-
- ```
- yarn logs -applicationId
- ```
-
----
-
-
-## Sparkling Shell Console Output
-The console output for Sparkling Shell by default will show a verbose Spark output as well as H2O logs. If you would like to switch the output to
-only warnings from Spark, you will need to change it in the log4j properities file in Spark's configuration directory. To do this:
-
- ```
- cd $SPARK_HOME/conf
- cp log4j.properties.template log4j.properties
- ```
-
-Then either in a text editor or vim to change the contents of the log4j.properties file from:
-
- ```
- #Set everything to be logged to the console
- log4j.rootCategory=INFO, console
- ...
- ```
-
-to:
-
- ```
- #Set everything to be logged to the console
- log4j.rootCategory=WARN, console
- ...
- ```
-
----
-
-
-## H2O Frame as Spark's Data Source
- The way how H2O Frame can be used as Spark's Data Source differs a little bit in Python and Scala.
-
-### Usage in Python - pySparkling
-
-#### Reading from H2O Frame
-Let's suppose we have H2OFrame `frame`.
-
-There are two ways how dataframe can be loaded from H2OFrame in pySparkling:
-
-```
-df = sqlContext.read.format("h2o").option("key",frame.frame_id).load()
-```
-
-or
-
-```
-df = sqlContext.read.format("h2o").load(frame.frame_id)
-```
-
-#### Saving to H2O Frame
-Let's suppose we have DataFrame `df`.
-
-There are two ways how dataframe can be saved as H2OFrame in pySparkling:
-
-```
-df.write.format("h2o").option("key","new_key").save()
-```
-
-or
-
-```
-df.write.format("h2o").save("new_key")
-```
-
-Both variants save dataframe as H2OFrame with key "new_key". They won't succeed if the H2OFrame with the same key already exists.
-
-#### Loading & Saving Options
-If the key is specified as 'key' option and also in the load/save method, the option 'key' is preferred
-
-```
-df = sqlContext.read.from("h2o").option("key","key_one").load("key_two")
-```
-
-or
-
-```
-df = sqlContext.read.from("h2o").option("key","key_one").save("key_two")
-```
-
-In both examples, "key_one" is used.
-
-
-### Usage in Scala
-
-#### Reading from H2O Frame
-Let's suppose we have H2OFrame `frame`
-
-The shortest way how dataframe can be loaded from H2OFrame with default settings is:
-
-```
-val df = sqlContext.read.h2o(frame.key)
-```
-
-There are two more ways how dataframe can be loaded from H2OFrame allowing us to specify additional options:
-
-```
-val df = sqlContext.read.format("h2o").option("key",frame.key.toString).load()
-```
-
-or
-
-```
-val df = sqlContext.read.format("h2o").load(frame.key.toString)
-```
-
-#### Saving to H2O Frame
-Let's suppose we have DataFrame `df`
-
-The shortest way how dataframe can be saved as H2O Frame with default settings is:
-
-```
-df.write.h2o("new_key")
-```
-
-There are two more ways how dataframe can be saved as H2OFrame allowing us to specify additional options:
-
-```
-df.write.format("h2o").option("key","new_key").save()
-```
-
-or
-
-```
-df.write.format("h2o").save("new_key")
-```
-
-All three variants save dataframe as H2OFrame with key "new_key". They won't succeed if the H2O Frame with the same key already exists
-
-#### Loading & Saving Options
-If the key is specified as 'key' option and also in the load/save method, the option 'key' is preferred
-
-```
-val df = sqlContext.read.from("h2o").option("key","key_one").load("key_two")
-```
-
-or
-
-```
-val df = sqlContext.read.from("h2o").option("key","key_one").save("key_two")
-```
-
-In both examples, "key_one" is used.
-
-
-### Specifying Saving Mode
-There are four save modes available when saving data using Data Source API- see http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
-
-If "append" mode is used, an existing H2OFrame with the same key is deleted and new one containing union of
-all rows from original H2O Frame and appended Data Frame is created with the same key.
-
-If "overwrite" mode is used, an existing H2OFrame with the same key is deleted and new one with the new rows is created with the same key.
-
-If "error" mode is used and a H2OFrame with the specified key already exists, exception is thrown.
-
-if "ignore" mode is used and a H2OFrame with the specified key already exists, no data are changed.
-
-
-## Sparkling Water Tuning
-For running Sparkling Water general recommendation are:
- - increase available memory in driver and executors (options `spark.driver.memory` resp., `spark.yarn.am.memory` and `spark.executor.memory`),
- - make cluster homogeneous - use the same value for driver and executor memory
- - increase PermGen size if you are running on top of Java7 (options `spark.driver.extraJavaOptions` resp., `spark.yarn.am.extraJavaOptions` and `spark.executor.extraJavaOptions`)
- - in rare cases, it helps to increase `spark.yarn.driver.memoryOverhead`, `spark.yarn.am.memoryOverhead`, or `spark.yarn.executor.memoryOverhead`
-
-For running Sparkling Water on top of Yarn:
- - make sure that Yarn provides stable containers, do not use preemptive Yarn scheduler
- - make sure that Spark application manager has enough memory and increase PermGen size
- - in case of a container failure Yarn should not restart container and application should gracefully terminate
-
-
-Furthermore, we recommend to configure the following Spark properties to speedup and stabilize creation of H2O services on top of Spark cluster:
-
-| Property | Context | Value | Explanation |
-|----------|---------|-------|-------------|
-| `spark.locality.wait` | all | `3000` | Number of seconds to wait for task launch on data-local node. We recommend to increase since we would like to make sure that H2O tasks are processed locally with data.|
-| `spark.scheduler.minRegisteredResourcesRatio` | all| `1` | Make sure that Spark starts scheduling when it sees 100% of resources. |
-| `spark.task.maxFailures` | all | `1`| Do not try to retry failed tasks. |
-| `spark...extraJavaOptions` | all| `-XX:MaxPermSize=384m` | Increase PermGen size if you are running on Java7. Make sure to configure it on driver/executor/Yarn application manager. |
-| `spark.yarn.....memoryOverhead` | yarn | increase | Increase memoryOverhead if it is necessary. |
-| `spark.yarn.max.executor.failures` | yarn | `1` | Do not try restart executors after failure and directly fail computation. |
-| `spark.executor.heartbeatInterval` | all | `10s` | Interval between each executor's heartbeats to the driver. This property should be significantly less than spark.network.timeout. |
-
-
-## Sparkling Water and Zeppelin
-Since Sparkling Water exposes Scala API, it is possible to access it directly from the Zeppelin's notebook cell marked by `%spark` tag.
-
-### Launch Zeppelin with Sparkling Water
-Using Sparkling Water from Zeppelin is easy since Sparkling Water is distributed as a Spark package.
-In this case, before launching Zeppelin addition shell variable is needed:
-
-```bash
-export SPARK_HOME=...# Spark 2.0 home
-export SPARK_SUBMIT_OPTIONS="--packages ai.h2o:sparkling-water-examples_2.11:2.0.0"
-bin/zeppelin.sh -Pspark-2.0
-```
-
-The command is using Spark 2.0 version and corresponding Sparkling Water package.
-
-### Using Zeppelin
-The use of Sparkling Water package is directly driven by Sparkling Water API. For example, getting `H2OContext` is straightforward:
-
-```scala
-%spark
-import org.apache.spark.h2o._
-val hc = H2OContext.getOrCreate(sc)
-```
-
-Creating `H2OFrame` from Spark `DataFrame`:
-```scala
-%spark
-val df = sc.parallelize(1 to 1000).toDF
-val hf = hc.asH2OFrame(df)
-```
-
-Creating Spark `DataFrame` from `H2OFrame`:
-```scala
-%spark
-val df = hc.asDataFrame(hf)
-```
-
diff --git a/README.rst b/README.rst
index dc5fdcd79..3188955a6 100644
--- a/README.rst
+++ b/README.rst
@@ -31,6 +31,9 @@ Please, switch to the right branch:
branch includes the latest changes for the latest Spark version.
They are back-ported into older Sparkling Water versions.
+.. The Requirements section is copied from doc/requirements.rst as github does not support include directive of
+.. reStructuredText
+
Requirements
~~~~~~~~~~~~
@@ -168,6 +171,8 @@ List of all Frequently Asked Questions is available at `FAQ `__.
Development
-----------
+Complete development documentation is available at `Development Documentation `__.
+
Build Sparkling Water
~~~~~~~~~~~~~~~~~~~~~
diff --git a/apps/streaming/README.md b/apps/streaming/README.md
deleted file mode 100644
index 9e03b9017..000000000
--- a/apps/streaming/README.md
+++ /dev/null
@@ -1,58 +0,0 @@
-# Real Time Pipeline To H2OFrames Using Sparkling Water
-
-This demo will show you how to take streaming data and create a "live" dataframe over some rolling time window. The potential use cases could be using H2O data munging capabilities on a real time distributed dataframe or mini batch training for online ML using H2O [checkpointing models] (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html?#%E2%80%A6%20Building%20Models-Viewing%20Models-Checkpointing%20Models).
-
-## Requirements
-
-sbt
-Spark 2.0
-Sparkling Water 2.0.x
-Python
-
-## Directions
- 1. Build the project `./gradlew clean jar`
-
- 2. Run the demo class `ai.h2o.PipelineDemo`
- ```bash
- $SPARK_HOME/bin/spark-submit \
- --class ai.h2o.PipelineDemo \
- --master 'local[*]' \
- --driver-memory 2G \
- --packages ai.h2o:sparkling-water-core_2.11:2.0.0 \
- --conf spark.scheduler.minRegisteredResourcesRatio=1 \
- --conf spark.ext.h2o.repl.enabled=False \
- ./build/libs/*.jar \
- 9999
- ```
-
- 3. You should see some errors `Error connecting to localhost:9999` as Spark Streaming starts to run but this can be fixed by connecting the stream source with:
- ```bash
- python socket_send_spark.py
- ```
-
- 4. Then you should be able to login in to Flow/R/Python and see the live H2OFrame
-
-## Caveats
-
-When Sparkling Water converts the RDD to an H2OFrame, you will not be able to access it during
-this time. Its good to test and find the best periodicity in generating your H2OFrame. also, it
-is preferable that all the steps for your use case is done in the events.window loop:
-
-```scala
- events.window(Seconds(300), Seconds(10)).foreachRDD(rdd =>
- {
- if (!rdd.isEmpty ) {
- try {
- hf.delete()
- } catch { case e: Exception => println("Initialized frame") }
- hf = hc.asH2OFrame(rdd.toDF(), "demoFrame10s") //the name of the frame
- // perform your munging, score your events with a pretrained model or
- // mini-batch training on checkpointed models, etc
- // make sure your execution finishes within the batch cycle (the
- // second arg in the window)
- }
- }
- )
-```
-
-
diff --git a/apps/streaming/README.rst b/apps/streaming/README.rst
new file mode 100644
index 000000000..e98f8692f
--- /dev/null
+++ b/apps/streaming/README.rst
@@ -0,0 +1,75 @@
+Real Time Pipeline To H2OFrames Using Sparkling Water
+=====================================================
+
+This demo shows you how to take streaming data and create a "live"
+dataframe over some rolling time window. The potential use cases could
+be using H2O data munging capabilities on a real time distributed
+dataframe or mini batch training for online ML using H2O.
+
+Requirements
+------------
+
+- sbt
+- Spark 2.+
+- Sparkling Water 2.+
+- Python 2.6+
+
+Directions
+----------
+
+1. Build the project:
+
+.. code:: bash
+
+ ./gradlew build -x check
+
+2. Run the demo class ``ai.h2o.PipelineDemo``:
+
+.. code:: bash
+
+ $SPARK_HOME/bin/spark-submit \
+ --class ai.h2o.PipelineDemo \
+ --master 'local[*]' \
+ --driver-memory 2G \
+ --packages ai.h2o:sparkling-water-core_2.11:2.1.0 \
+ --conf spark.scheduler.minRegisteredResourcesRatio=1 \
+ --conf spark.ext.h2o.repl.enabled=False \
+ ./build/libs/*.jar 9999
+
+3. You should see some errors ``Error connecting to localhost:9999`` as
+ Spark Streaming starts to run but this can be fixed by connecting the
+ stream source with:
+
+.. code:: bash
+
+ python socket_send_spark.py
+
+4. Then you should be able to login in to Flow/R/Python and see the live
+ ``H2OFrame``.
+
+Caveats
+-------
+
+When Sparkling Water converts the ``RDD`` to an ``H2OFrame``, you will not be
+able to access it during this time. Its good to test and find the best
+periodicity in generating your H2OFrame. Also, it is preferable that all
+the steps for your use case are done in the events.window loop:
+
+.. code:: scala
+
+ events.window(Seconds(300), Seconds(10)).foreachRDD(rdd =>
+ {
+ if (!rdd.isEmpty ) {
+ try {
+ hf.delete()
+ } catch {
+ case e: Exception => println("Initialized frame")
+ }
+ hf = hc.asH2OFrame(rdd.toDF(), "demoFrame10s") //the name of the frame
+ // perform your munging, score your events with a pretrained model or
+ // mini-batch training on checkpointed models, etc
+ // make sure your execution finishes within the batch cycle (the
+ // second arg in the window)
+ }
+ }
+ )
diff --git a/design-doc/images/Sparkling Water cluster.png b/design-doc/images/Sparkling Water cluster.png
deleted file mode 100644
index 160fb8514..000000000
Binary files a/design-doc/images/Sparkling Water cluster.png and /dev/null differ
diff --git a/design-doc/images/Topology.png b/design-doc/images/Topology.png
deleted file mode 100644
index 0399a7a38..000000000
Binary files a/design-doc/images/Topology.png and /dev/null differ
diff --git a/design-doc/images/src/sparkling_cluster_diagram.pptx b/design-doc/images/src/sparkling_cluster_diagram.pptx
deleted file mode 100644
index 5d157559b..000000000
Binary files a/design-doc/images/src/sparkling_cluster_diagram.pptx and /dev/null differ
diff --git a/design-doc/pysparkling.md b/design-doc/pysparkling.md
deleted file mode 100644
index cd436bd31..000000000
--- a/design-doc/pysparkling.md
+++ /dev/null
@@ -1,107 +0,0 @@
-# pySparkling
-
-## Goal
-Provide transparent user experience of using Sparkling Water from Python.
-It includes:
- - support creation of H2OContext
- - support data transfers - from H2OFrame to DataFrame/RDD and back
- - transparent use of H2OFrames and RDDs
-
-## Usage
-
-Command to launch pyspark with Sparkling Water:
- ```
-PYSPARK_PYTHON=ipython $SPARK_HOME/bin/pyspark --packages ai.h2o:sparkling-water-core_2.10:1.5.2,ai.h2o:sparkling-water-examples_2.10:1.5.2
-```
-
-Command to launch Python script `script.py` via `spark-submit`:
-```
-$SPARK_HOME/bin/spark-submit --packages ai.h2o:sparkling-water-core_2.10:1.5.2,ai.h2o:sparkling-water-examples_2.10:1.5.2 script.py
-```
-
-Creating H2O Context:
-```
-hc = H2OContext(sc)
-```
-
-
-Testing - in sparkling-water/py directory:
-```
-import sys
-sys.path.append(".")
-import pysparkling.context
-hc = pysparkling.context.H2OContext(sc)
-hc.start()
-```
-
-## Technical details
-
-Use Py4J bundled with PySpark to access JVM classes
-
-```
-from py4j.java_gateway import java_import
-java_import(sc._jvm, "org.apache.spark.h2o.*")
-jvm = sc._jvm
-gw = sc._gateway
-hc_klazz = jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass("org.apache.spark.h2o.H2OContext")
-ctor_def = gw.new_array(jvm.Class, 1)
-ctor_def[0] = sc._jsc.getClass()
-hc_ctor = hc_klazz.getConstructor(ctor_def)
-ctor_params = gw.new_array(jvm.Object, 1)
-ctor_params[0] = sc._jsc
-hc = hc_ctor.newInstance(ctor_params)
-hc.start()
-
-```
-
-# Running and debugging
-
-## From Python notebook
-Start script:
-
-```
-#!/usr/bin/env bash
-export PYTHONPATH=$H2O_HOME/h2o-py:$SPARKLING_HOME/py:$PYTHONPATH
-export SPARK_CLASSPATH=$SPARK_CLASSPATH:$SPARKLING_HOME/assembly/build/libs/sparkling-water-assembly-0.2.17-SNAPSHOT-all.jar
-echo $SPARK_CLASSPATH
-IPYTHON_OPTS="notebook" $SPARK_HOME/bin/pyspark
-```
-
-where:
- - `H2O_HOME` points to H2O project which contains right python package
- - `SPARKLING_HOME` points to Sparkling project git repository
- - before run, Sparkling Water has to be build: `./gradlew build -x test`
-
-# Testing
-
-## Development testing
-```
- export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:./:$PYTHONPATH
- export SPARK_CLASSPATH=./sparkling-water/assembly/build/libs/sparkling-water-assembly-0.2.17-SNAPSHOT-all.jar
- python -m unittest pysparkling/tests.py
-```
-## Testing from Gradle
-Needs:
- - SPARK_HOME
- - H2O_HOME (dev, downloadeded) - you can test against different H2O versions
- - python
- - python packages for H2O
-
-## Notes
-For Python applications, simply pass a .py file in the place of instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.
-
-## Problems
-
-Spark Issues
- - https://issues.apache.org/jira/browse/SPARK-5185 - `--jars` packages are not appended to driver path
- * Solution: `sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass("org.apache.spark.h2o.H2OContext").newInstance()`
- * ```
- We've been setting SPARK_SUBMIT_CLASSPATH as a workaround to this issue, but as of https://github.com/apache/spark/commit/517975d89d40a77c7186f488547eed11f79c1e97 this variable no longer exists. We're now setting SPARK_CLASSPATH as a workaround.
- ```
-
- Related issue:
- - https://issues.apache.org/jira/browse/SPARK-6047
-
-
- # Neat
- - https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/DynamicClassLoader.java
diff --git a/design-doc/rest-api-scala-endpoints.md b/design-doc/rest-api-scala-endpoints.md
deleted file mode 100644
index deaf736a6..000000000
--- a/design-doc/rest-api-scala-endpoints.md
+++ /dev/null
@@ -1,46 +0,0 @@
-# Scala interpreter available through REST API
-
-## Basic design
-There is a simple pool of interpreters created at the start. Once an scala interpreter is associated with
-the session, new interpreter is added to the pool. Each interpreter is deleted when it's not used for some
-fixed time
-
-## Example usage of scala interpreter using REST API
-Here, you can find some basic calls to access scala interpreter behind REST API using curl
-
-### init interpreter and obtain session ID
-curl --data-urlencode '' http://192.168.0.10:54321/3/scalaint
-
-### destroy session and interpreter associated with the session
-curl -X DELETE http://192.168.0.10:54321/3/scalaint/512ef484-e21a-48f9-979e-2879f63a779e
-
-### get all active sessions
-curl http://192.168.0.10:54321/3/scalaint
-
-### try to interpret the incomplete code, status is error
-curl --data-urlencode code='sc.' http://192.168.0.10:54321/3/scalaint/c3e5ea38-0b7e-4136-9ba3-21615ea2d298
-
-### try to interpret the code, status is error (function does not exist)
-curl --data-urlencode code='foo()' http://192.168.0.10:54321/3/scalaint/c3e5ea38-0b7e-4136-9ba3-21615ea2d298
-
-### try to interpret the code, result is success
-curl --data-urlencode code='21+21' http://192.168.0.10:54321/3/scalaint/c3e5ea38-0b7e-4136-9ba3-21615ea2d298
-
-### try to interpret the code with the spark context, use semicolon to separate commands
-curl --data-urlencode code='val data = Array(1, 2, 3, 4, 5); val distData = sc.parallelize(data); val result = distData.map(s => s+10)' http://192.168.0.10:54321/3/scalaint/c3e5ea38-0b7e-4136-9ba3-21615ea2d298
-
-### try to interpret the code with the spark context, use new lines to separate commands
-curl --data-urlencode code='
-val data = Array(1, 2, 3, 4, 5)
-val distData = sc.parallelize(data)
-val result = distData.map(s => s+10)' http://192.168.0.10:54321/3/scalaint/c3e5ea38-0b7e-4136-9ba3-21615ea2d298
-
-### declare class and use it in the next call
-curl --data-urlencode code='
-case class A(number: Int)' http://192.168.0.10:54321/3/scalaint/c3e5ea38-0b7e-4136-9ba3-21615ea2d298
-
-curl --data-urlencode code='
-val data = Array(1, 2, 3, 4, 5)
-val distData = sc.parallelize(data)
-val result = distData.map(s => A(s))' http://192.168.0.10:54321/3/scalaint/c3e5ea38-0b7e-4136-9ba3-21615ea2d298
-
diff --git a/doc/DEVEL.rst b/doc/DEVEL.rst
new file mode 100644
index 000000000..701dae0cd
--- /dev/null
+++ b/doc/DEVEL.rst
@@ -0,0 +1,56 @@
+Sparkling Water Development Documentation
+=========================================
+
+This page acts only as the table of content and provides links to the
+actual documentation documents.
+
+- `Typical Use Case `__
+- `Requirements `__
+- `Design `__
+
+ - `Basic Primitives `__
+ - `Supported Platforms `__
+ - `Spark Frame - H2O Frame Mapping `__
+ - `Data Sharing Between H2O and Spark `__
+ - `Supported Data Sources `__
+ - `Supported Data Formats `__
+
+- `Sparkling Water Configuration `__
+
+ - `Sparkling Water Configuration Properties `__
+ - `Memory Allocation `__
+ - `Sparkling Water Internal Backend Tuning `__
+ - `Sparkling Water Backends Configuration `__
+
+- `Tutorials `__
+
+ - `Running Sparkling Water `__
+ - `Running Sparkling Water on Windows `__
+
+ - `Sparkling Water Backends Run & Configuration `__
+ - `Enabling Security `__
+ - `Calling H2O Algorithms `__
+ - Frames Conversions & Creation
+
+ - `Spark - H2O Frame Conversions `__
+ - `H2O Frame as Spark's Data Source `__
+ - `Create H2OFrame From an Existing Key `__
+
+ - Logging
+
+ - `Change Sparkling Shell Logging Level `__
+ - `Obtain Sparkling Water Logs `__
+
+ - `Sparkling Water and Zeppelin `__
+ - `Use Sparkling Water as Spark Package `__
+
+- `Development `__
+
+ - `Building `__
+ - `Running Sparkling Water Examples `__
+ - `Running Unit Tests `__
+ - `Integration Tests `__
+
+- `Sparkling Water REST API `__
+
+ - `Scala Interpreter REST API `__
diff --git a/doc/configuration/configuration.rst b/doc/configuration/configuration.rst
new file mode 100644
index 000000000..1a0f1e9df
--- /dev/null
+++ b/doc/configuration/configuration.rst
@@ -0,0 +1,7 @@
+Configuration
+-------------
+
+- `Sparkling Water Configuration Properties `__
+- `Memory Allocation `__
+- `Sparkling Water Internal Backend Tuning `__
+- `Sparkling Water Backends Configuration <../tutorials/backends.rst>`__
diff --git a/doc/configuration_properties.rst b/doc/configuration/configuration_properties.rst
similarity index 100%
rename from doc/configuration_properties.rst
rename to doc/configuration/configuration_properties.rst
diff --git a/doc/configuration/internal_backend_tuning.rst b/doc/configuration/internal_backend_tuning.rst
new file mode 100644
index 000000000..f884fcee5
--- /dev/null
+++ b/doc/configuration/internal_backend_tuning.rst
@@ -0,0 +1,81 @@
+Sparkling Water Tuning
+----------------------
+
+For running Sparkling Water general recommendation are:
+
+- Increase available memory in driver and executors (options ``spark.driver.memory`` resp., ``spark.yarn.am.memory`` and ``spark.executor.memory``).
+- Make cluster homogeneous - use the same value for driver and executor memory.
+- Increase PermGen size if you are running on top of Java7 (options ``spark.driver.extraJavaOptions`` resp., ``spark.yarn.am.extraJavaOptions`` and ``spark.executor.extraJavaOptions``).
+- In rare cases, it helps to increase ``spark.yarn.driver.memoryOverhead``, ``spark.yarn.am.memoryOverhead``, or ``spark.yarn.executor.memoryOverhead``.
+
+For running Sparkling Water on top of Yarn:
+
+- Make sure that Yarn provides stable containers, do not use preemptive Yarn scheduler.
+- Make sure that Spark application manager has enough memory and increase PermGen size.
+- In case of a container failure, Yarn should not restart container and application should gracefully terminate.
+
+Furthermore, we recommend to configure the following Spark properties to
+speedup and stabilize creation of H2O services on top of Spark cluster:
+
++-------------------------------------------------+--------------------------+----------------------------+
+| Property | Value | Explanation |
++=================================================+==========================+============================+
+| **All environments (YARN/Standalone/Local)** | | |
++-------------------------------------------------+--------------------------+----------------------------+
+| ``spark.locality.wait`` | ``3000`` | Number of seconds to wait |
+| | | for task launched on |
+| | | data-local node. We |
+| | | recommend to increase |
+| | | since we would like to |
+| | | make sure that H2O tasks |
+| | | are processed locally |
+| | | with data. |
++-------------------------------------------------+--------------------------+----------------------------+
+| ``spark.scheduler.minRegisteredResourcesRatio`` | ``1`` | Make sure that Spark |
+| | | starts scheduling when it |
+| | | sees 100% of resources. |
++-------------------------------------------------+--------------------------+----------------------------+
+| ``spark.task.maxFailures`` | ``1`` | Do not try to retry |
+| | | failed tasks. |
++-------------------------------------------------+--------------------------+----------------------------+
+| ``spark.driver.extraJavaOptions`` | ``-XX:MaxPermSize=384m`` | Increase PermGem if you |
+| | | are running in Java7 on |
+| | | the Spark driver. |
++-------------------------------------------------+--------------------------+----------------------------+
+| ``spark.executor.extraJavaOptions`` | ``-XX:MaxPermSize=384m`` | Increase PermGem if you |
+| | | are running in Java7 on |
+| | | the Spark executor. |
++-------------------------------------------------+--------------------------+----------------------------+
+| ``spark.executor.heartbeatInterval`` | ``10s`` | Interval between each |
+| | | executor heartbeats to |
+| | | the driver. This property |
+| | | should be significantly |
+| | | less than |
+| | | ``spark.network.timeout``. |
++-------------------------------------------------+--------------------------+----------------------------+
+| **YARN environment** | | |
++-------------------------------------------------+--------------------------+----------------------------+
+| ``spark.yarn.am.extraJavaOptions`` | ``-XX:MaxPermSize=384m`` | Increase PermGem if you |
+| | | are running in Java7 on |
+| | | the Yarn application |
+| | | master. |
++-------------------------------------------------+--------------------------+----------------------------+
+| ``spark.yarn.driver.memoryOverhead`` | increase | Increase memory overhead |
+| | | if it's necessary of the |
+| | | container with |
+| | | driver node. |
++-------------------------------------------------+--------------------------+----------------------------+
+| ``spark.yarn.executor.memoryOverhead`` | increase | Increase memory overhead |
+| | | if it's necessary of the |
+| | | containers with |
+| | | executor nodes. |
++-------------------------------------------------+--------------------------+----------------------------+
+| ``spark.yarn.am.memoryOverhead`` | increase | Increase memory overhead |
+| | | if it's necessary of the |
+| | | Yarn application master. |
++-------------------------------------------------+--------------------------+----------------------------+
+| ``spark.yarn.max.executor.failures`` | ``1`` | Do not try to restart |
+| | | executors after failure |
+| | | and directly fail the |
+| | | computation. |
++-------------------------------------------------+--------------------------+----------------------------+
diff --git a/doc/configuration/memory_setup.rst b/doc/configuration/memory_setup.rst
new file mode 100644
index 000000000..df2024c4b
--- /dev/null
+++ b/doc/configuration/memory_setup.rst
@@ -0,0 +1,34 @@
+Memory Allocation
+-----------------
+
+H2O resides in the same executor JVM as Spark. The memory provided for
+H2O is configured via Spark; refer to `Spark
+configuration `__
+for more details.
+
+Generic configuration
+~~~~~~~~~~~~~~~~~~~~~
+
+- Configure the Executor memory (i.e., memory available for H2O) via
+ the Spark configuration property ``spark.executor.memory`` .
+
+ > For example, ``bin/sparkling-shell --conf spark.executor.memory=5g`` or
+ configure the property in ``$SPARK_HOME/conf/spark-defaults.conf``
+
+- Configure the Driver memory (i.e., memory available for H2O client
+ running inside Spark driver) via the Spark configuration property
+ ``spark.driver.memory``
+
+ > For example, ``bin/sparkling-shell --conf spark.driver.memory=4g`` or configure
+ the property in ``$SPARK_HOME/conf/spark-defaults.conf``
+
+Yarn specific configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- Refer to the `Spark
+ documentation `__
+
+- For JVMs that require a large amount of memory, we strongly recommend
+ configuring the maximum amount of memory available for individual
+ mappers. For information on how to do this using Yarn, refer to
+ http://docs.h2o.ai/h2o/latest-stable/h2o-docs/faq/hadoop.html
diff --git a/doc/design/basic_primitives.rst b/doc/design/basic_primitives.rst
new file mode 100644
index 000000000..d7559a39a
--- /dev/null
+++ b/doc/design/basic_primitives.rst
@@ -0,0 +1,34 @@
+Provided Primitives
+-------------------
+
+The Sparkling Water provides following primitives, which are the basic
+classes used by Spark components:
+
++-------------------+--------------------------------------+--------------------------------------+
+| Concept | Implementation class | Description |
++===================+======================================+======================================+
+| H2O context | ``org.apache.spark.h2o.H2OContext`` | H2O Context that holds state and |
+| | | provides primitives to transfer |
+| | | RDD/DataFrames/Datasets into |
+| | | H2OFrame and vice versa. It follows |
+| | | design principles of Spark |
+| | | primitives such as ``SparkSession``, |
+| | | ``SparkContext`` and ``SQLContext``. |
++-------------------+--------------------------------------+--------------------------------------+
+| H2O entry point | ``water.H2O`` | Represents the entry point for |
+| | | for accessing H2O services. It holds |
+| | | information about the actual H2O |
+| | | cluster, including a list of nodes |
+| | | and the status of distributed K/V |
+| | | datastore. |
++-------------------+--------------------------------------+--------------------------------------+
+| H2O H2OFrame | ``water.fvec.H2OFrame`` | H2OFrame is the H2O data structure |
+| | | that represents a table of values. |
+| | | The table is column-based and |
+| | | provides column and row accessors. |
++-------------------+--------------------------------------+--------------------------------------+
+| H2O Algorithms | package ``hex`` | Represents the H2O machine learning |
+| | | algorithms library, including, for |
+| | | example, DeepLearning, GBM or |
+| | | RandomForest. |
++-------------------+--------------------------------------+--------------------------------------+
diff --git a/doc/design/data_sharing.rst b/doc/design/data_sharing.rst
new file mode 100644
index 000000000..f8f947bd3
--- /dev/null
+++ b/doc/design/data_sharing.rst
@@ -0,0 +1,33 @@
+Data Sharing
+------------
+
+Sparkling Water enables transformation between different types of Spark
+``RDD`` and H2O's ``H2OFrame``, and vice versa.
+
+Conversion Design
+~~~~~~~~~~~~~~~~~
+
+When converting from ``H2OFrame`` to ``RDD``, a wrapper is created
+around the ``H2OFrame`` to provide an RDD-like API. In this case, no
+data is duplicated; instead, the data is served directly from the
+underlying ``H2OFrame``.
+
+Converting in the opposite direction (i.e, from Spark
+``RDD``/``DataFrame`` to ``H2OFrame``) needs evaluation of data stored
+in Spark ``RDD`` and transfer them from RDD storage into ``H2OFrame``.
+However, data stored in ``H2OFrame`` is heavily compressed.
+
+Exchanging the Data
+~~~~~~~~~~~~~~~~~~~
+
+The way how data is transferred between Spark and H2O differs on the used
+Sparkling Water backend.
+
+In the Internal Sparkling Water Backend, Spark and H2O share the same
+JVM as is depicted on the following figure. |Data Sharing|
+
+In the External Sparkling Water Backend, Spark and H2O are separated
+clusters and data has to be send between these cluster over the network.
+
+.. |Data Sharing| image:: ../images/internal_backend_data_sharing.png
+
diff --git a/doc/design/design.rst b/doc/design/design.rst
new file mode 100644
index 000000000..0416210d3
--- /dev/null
+++ b/doc/design/design.rst
@@ -0,0 +1,57 @@
+Design
+------
+
+Basic Introduction
+~~~~~~~~~~~~~~~~~~
+
+Sparkling Water is designed to be executed as a regular Spark
+application. It provides a way to initialize H2O services on each node
+in the Spark cluster and access data stored in data structures of Spark
+and H2O.
+
+Sparkling Water provides transparent integration for the H2O engine
+and its machine learning algorithms into the Spark platform, enabling:
+
+- Use of H2O algorithms in Spark workflow.
+- Transformation between H2O and Spark data structures.
+- Use of Spark RDDs as input for H2O algorithms.
+- Transparent execution of Sparkling Water applications on top of Spark.
+
+Sparkling Water supports to type of backends. In the internal backend,
+Sparkling Water is launched inside a Spark executor, which is created
+after application submission. At this point, H2O starts services,
+including distributed KV store and memory manager, and orchestrates them
+into a cloud. The topology of the created cloud matches the topology of
+the underlying Spark cluster exactly. The following figure represents the Internal
+Sparkling Water cluster.
+
+.. figure:: ../images/internal_backend.png
+ :alt: Internal Sparkling Water Cluster Topology
+
+In external backend, the H2O cluster is started separately and we
+connect to it from the Spark driver. The following figure represents the External
+Sparkling Water cluster.
+
+
+.. figure:: ../images/external_backend.png
+ :alt: External Sparkling Water Cluster Topology
+
+
+To read more about the backends, please visit `Sparkling Water
+Backends <../tutorials/backends.rst>`__.
+
+When H2O services are running, it is possible to create H2O data
+structures, call H2O algorithms, and transfer values from/to RDD.
+
+More Materials
+~~~~~~~~~~~~~~
+
+To read more about Sparkling Water design, you can visit on of the links
+bellow:
+
+- `Basic Primitives `__
+- `Supported Platforms `__
+- `Spark Frame - H2O Frame Mapping `__
+- `Data Sharing Between H2O and Spark `__
+- `Supported Data Sources `__
+- `Supported Data Formats `__
diff --git a/doc/design/spark_h2o_mapping.rst b/doc/design/spark_h2o_mapping.rst
new file mode 100644
index 000000000..1d0556724
--- /dev/null
+++ b/doc/design/spark_h2o_mapping.rst
@@ -0,0 +1,73 @@
+Spark - H2O Frame Mapping
+-------------------------
+
+Type mapping between H2O H2OFrame types and Spark DataFrame types
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For all primitive Scala types or Spark SQL (see
+``org.apache.spark.sql.types``) types which can be part of Spark
+RDD/DataFrame we provide mapping into H2O vector types (numeric,
+categorical, string, time, UUID - see ``water.fvec.Vec``):
+
++----------------------+-----------------+------------+
+| Scala type | SQL type | H2O type |
++======================+=================+============+
+| *NA* | BinaryType | Numeric |
++----------------------+-----------------+------------+
+| Byte | ByteType | Numeric |
++----------------------+-----------------+------------+
+| Short | ShortType | Numeric |
++----------------------+-----------------+------------+
+| Integer | IntegerType | Numeric |
++----------------------+-----------------+------------+
+| Long | LongType | Numeric |
++----------------------+-----------------+------------+
+| Float | FloatType | Numeric |
++----------------------+-----------------+------------+
+| Double | DoubleType | Numeric |
++----------------------+-----------------+------------+
+| String | StringType | String |
++----------------------+-----------------+------------+
+| Boolean | BooleanType | Numeric |
++----------------------+-----------------+------------+
+| java.sql.Timestamp | TimestampType | Time |
++----------------------+-----------------+------------+
+
+--------------
+
+Type mapping between H2O H2OFrame types and RDD[T] types
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As type ``T`` we support following types:
+
++--------------------------------------------------+
+| T |
++==================================================+
+| *NA* |
++--------------------------------------------------+
+| Byte |
++--------------------------------------------------+
+| Short |
++--------------------------------------------------+
+| Integer |
++--------------------------------------------------+
+| Long |
++--------------------------------------------------+
+| Float |
++--------------------------------------------------+
+| Double |
++--------------------------------------------------+
+| String |
++--------------------------------------------------+
+| Boolean |
++--------------------------------------------------+
+| java.sql.Timestamp |
++--------------------------------------------------+
+| Any scala class extending scala ``Product`` |
++--------------------------------------------------+
+| org.apache.spark.mllib.regression.LabeledPoint |
++--------------------------------------------------+
+
+As is specified in the table, Sparkling Water provides support for
+transforming arbitrary scala class extending ``Product``, which are, for
+example, all case classes.
diff --git a/doc/design/supported_data_formats.rst b/doc/design/supported_data_formats.rst
new file mode 100644
index 000000000..9bb2f5006
--- /dev/null
+++ b/doc/design/supported_data_formats.rst
@@ -0,0 +1,8 @@
+Supported Data Formats
+----------------------
+
+Sparkling Water can read data stored in the following formats:
+
+- CSV
+- SVMLight
+- ARFF
diff --git a/doc/design/supported_data_sources.rst b/doc/design/supported_data_sources.rst
new file mode 100644
index 000000000..44ffd7f64
--- /dev/null
+++ b/doc/design/supported_data_sources.rst
@@ -0,0 +1,11 @@
+Supported Data Sources
+----------------------
+
+Currently, Sparkling Water can use the following data source types:
+
+- Standard RDD API to load data and transform them into ``H2OFrame``
+- H2OAPI to load data directly into ``H2OFrame`` from:
+
+ - local file(s)
+ - HDFS file(s)
+ - S3 file(s)
diff --git a/doc/design/supported_platforms.rst b/doc/design/supported_platforms.rst
new file mode 100644
index 000000000..11a2459b2
--- /dev/null
+++ b/doc/design/supported_platforms.rst
@@ -0,0 +1,61 @@
+Supported platforms
+-------------------
+
+Sparkling Water can run on top of Spark in the various ways, however
+starting Sparkling Water requires different configuration on different
+environments:
+
+Local
+~~~~~
+
+In this case Sparkling Water runs as a local cluster (Spark master
+variable points to one of values ``local``, ``local[*]``, or
+``local-cluster[...]``
+
+Standalone Spark Cluster
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+`Spark documentation - running Standalone
+cluster `__
+
+YARN
+~~~~
+
+`Spark documentation - running Spark Application on
+YARN `__
+
+When submitting Sparkling Water application to CHD or Apache Hadoop
+cluster, the command to submit may look like:
+
+.. code:: bash
+
+ ./spark-submit --master=yarn-client --class water.SparklingWaterDriver --conf "spark.yarn.am.extraJavaOptions=-XX:MaxPermSize=384m -Dhdp.version=current"
+ --driver-memory=8G --num-executors=3 --executor-memory=3G --conf "spark.executor.extraClassPath=-XX:MaxPermSize=384m -Dhdp.version=current"
+ sparkling-water-assembly-2.1.9-all.jar
+
+When submitting sparkling water application to HDP Cluster, the command
+to submit may look like:
+
+.. code:: bash
+
+ ./spark-submit --master=yarn-client --class water.SparklingWaterDriver --conf "spark.yarn.am.extraJavaOptions=-XX:MaxPermSize=384m -Dhdp.version=current"
+ --driver-memory=8G --num-executors=3 --executor-memory=3G --conf "spark.executor.extraClassPath=-XX:MaxPermSize=384m -Dhdp.version=current"
+ sparkling-water-assembly-2.1.9-all.jar
+
+Apart from the typical spark configuration it is necessary to add
+``-XX:MaxPermSize=384m`` (or higher, but 384m is minimum) to both
+``spark.executor.extraClassPath`` and ``spark.yarn.am.extraJavaOptions``
+(or for client mode, ``spark.driver.extraJavaOptions`` for cluster mode)
+configuration properties in order to run Sparkling Water correctly.
+
+The only difference between HDP cluster and both CDH and Apache hadoop
+clusters is that we need to add ``-Dhdp.version=current`` to both
+``spark.executor.extraClassPath`` and ``spark.yarn.am.extraJavaOptions``
+(resp., ``spark.driver.extraJavaOptions``) configuration properties in
+the HDP case.
+
+Mesos
+~~~~~
+
+`Spark documentation - running Spark Application on
+Mesos `__
diff --git a/doc/build.rst b/doc/devel/build.rst
similarity index 85%
rename from doc/build.rst
rename to doc/devel/build.rst
index 4e647e43d..5713c932d 100644
--- a/doc/build.rst
+++ b/doc/devel/build.rst
@@ -7,9 +7,10 @@ Download Spark installation and point environment variable
Then use the provided ``gradlew`` to build project:
In order to build the whole project including PySparkling, one of the
-following properties needs to be set: \* ``H2O_HOME`` - should point to
-location of the local H2O project directory \* ``H2O_PYTHON_WHEEL`` -
-should point to H2O Python Wheel.
+following properties needs to be set:
+
+- ``H2O_HOME`` - should point to location of the local H2O project directory.
+- ``H2O_PYTHON_WHEEL`` - should point to H2O Python Wheel.
If you are not sure which property to set, just run
diff --git a/doc/devel/devel.rst b/doc/devel/devel.rst
new file mode 100644
index 000000000..cb9033b18
--- /dev/null
+++ b/doc/devel/devel.rst
@@ -0,0 +1,7 @@
+Development
+-----------
+
+- `Building `__
+- `Running Sparkling Water Examples `__
+- `Running Unit Tests `__
+- `Integration Tests `__
diff --git a/doc/devel/integ_tests.rst b/doc/devel/integ_tests.rst
new file mode 100644
index 000000000..db0d7a0e6
--- /dev/null
+++ b/doc/devel/integ_tests.rst
@@ -0,0 +1,210 @@
+Integration Tests
+-----------------
+
+This documentation paper describes what is tested in terms of
+integration tests as part of the Sparkling Water project.
+
+The tests are perform for both Sparkling Water backend types. Please see `Sparkling Water Backends <../tutorials/backends.rst>`__
+for more information about the backends.
+
+Quick links:
+
+- `Testing Environments`_
+- `Testing Scenarios`_
+- `Integration Tests Example`_
+
+Testing Environments
+--------------------
+
+- Local - corresponds to setting Spark ``MASTER`` variable to one of
+ ``local``, or ``local[*]``, or ``local-cluster[_,_,_]`` values
+- Standalone cluster - the ``MASTER`` variable points to existing
+ standalone Spark cluster ``spark://...``
+- YARN cluster - the ``MASTER`` variable contains ``yarn-client`` or
+ ``yarn-cluster`` values
+
+--------------
+
+Testing Scenarios
+-----------------
+
+1. Initialize H2O on top of Spark by running
+ ``H2OContext.getOrCreate(spark)`` and verifying that H2O was properly
+ initialized.
+2. Load data with help from the H2O API from various data sources:
+
+- local disk
+- HDFS
+- S3N
+
+3. Convert from ``RDD[T]`` to ``H2OFrame``
+4. Convert from ``DataFrame`` to ``H2OFrame``
+5. Convert from ``H2OFrame`` to ``RDD``
+6. Convert from ``H2OFrame`` to ``DataFrame``
+7. Integrate with H2O Algorithms using RDD as algorithm input
+8. Integrate with MLlib Algorithms using H2OFrame as algorithm input
+ (KMeans)
+9. Integrate with MLlib pipelines
+
+--------------
+
+Integration Tests Example
+-------------------------
+
+The following code reflects the use cases listed above. The code is
+executed in all testing environments (if applicable). Spark 2.0+ required:
+
+- local
+- standalone cluster
+- YARN
+
+1. Initialize H2O:
+
+.. code:: scala
+
+ import org.apache.spark.sql.SparkSession
+ val spark = SparkSession.builder().getOrCreate()
+ import org.apache.spark.h2o._
+ val h2oContext = H2OContext.getOrCreate(spark)
+ import h2oContext.implicits._
+
+2. Load data:
+
+- From the local disk:
+
+ .. code:: scala
+
+ import org.apache.spark.sql.SparkSession
+ val spark = SparkSession.builder().getOrCreate()
+ import org.apache.spark.h2o._
+ val h2oContext = H2OContext.getOrCreate(spark)
+
+ import java.io.File
+ val df: H2OFrame = new H2OFrame(new File("examples/smalldata/allyears2k_headers.csv.gz"))
+
+ Note: The file must be present on all nodes. In case of Sparkling Water internal backend, on all nodes with Spark. In case
+ of Sparkling Water external backend, on all nodes with H2O.
+
+
+- From HDFS:
+
+ .. code:: scala
+
+ import org.apache.spark.sql.SparkSession
+ val spark = SparkSession.builder().getOrCreate()
+ import org.apache.spark.h2o._
+ val h2oContext = H2OContext.getOrCreate(spark)
+
+ val path = "hdfs://mr-0xd6.0xdata.loc/datasets/airlines_all.csv"
+ val uri = new java.net.URI(path)
+ val airlinesHF = new H2OFrame(uri)
+
+- From S3N:
+
+ .. code:: scala
+
+ import org.apache.spark.sql.SparkSession
+ val spark = SparkSession.builder().getOrCreate()
+ import org.apache.spark.h2o._
+ val h2oContext = H2OContext.getOrCreate(spark)
+
+ val path = "s3n://h2o-airlines-unpacked/allyears2k.csv"
+ val uri = new java.net.URI(path)
+ val airlinesHF = new H2OFrame(uri)
+
+ Note: Spark/H2O needs to know the AWS credentials specified in ``core-site.xml``. The credentials are passed via ``HADOOP_CONF_DIR``
+ that points to a configuration directory with ``core-site.xml``.
+
+3. Convert from ``RDD[T]`` to ``H2OFrame``:
+
+.. code:: scala
+
+ import org.apache.spark.sql.SparkSession
+ val spark = SparkSession.builder().getOrCreate()
+ import org.apache.spark.h2o._
+ val h2oContext = H2OContext.getOrCreate(spark)
+
+ val rdd = sc.parallelize(1 to 1000, 100).map( v => IntHolder(Some(v)))
+ val hf: H2OFrame = h2oContext.asH2OFrame(rdd)
+
+4. Convert from ``DataFrame`` to ``H2OFrame``:
+
+.. code:: scala
+
+ import org.apache.spark.sql.SparkSession
+ val spark = SparkSession.builder().getOrCreate()
+ import org.apache.spark.h2o._
+ val h2oContext = H2OContext.getOrCreate(spark)
+
+ import spark.implicits._
+ val df = spark.sparkContext.parallelize(1 to 1000, 100).map(v => IntHolder(Some(v))).toDF
+ val hf = h2oContext.asH2OFrame(df)
+
+5. Convert from ``H2OFrame`` to ``RDD[T]``:
+
+.. code:: scala
+
+ import org.apache.spark.sql.SparkSession
+ val spark = SparkSession.builder().getOrCreate()
+ import org.apache.spark.h2o._
+ val h2oContext = H2OContext.getOrCreate(spark)
+
+ val rdd = spark.sparkContext.parallelize(1 to 1000, 100).map(v => IntHolder(Some(v)))
+ val hf: H2OFrame = h2oContext.asH2OFrame(rdd)
+ val newRdd = h2oContext.asRDD[IntHolder](hf)
+
+6. Convert from ``H2OFrame`` to ``DataFrame``:
+
+.. code:: scala
+
+ import org.apache.spark.sql.SparkSession
+ val spark = SparkSession.builder().getOrCreate()
+ import org.apache.spark.h2o._
+ val h2oContext = H2OContext.getOrCreate(spark)
+
+ import spark.implicits._
+ val df = spark.sparkContext.parallelize(1 to 1000, 100).map(v => IntHolder(Some(v))).toDF
+ val hf = h2oContext.asH2OFrame(df)
+ val newRdd = h2oContext.asDataFrame(hf)(spark.sqlContext)
+
+7. Integrate with H2O Algorithms using RDD as algorithm input:
+
+.. code:: scala
+
+ import org.apache.spark.sql.SparkSession
+ val spark = SparkSession.builder().getOrCreate()
+ import org.apache.spark.h2o._
+ val h2oContext = H2OContext.getOrCreate(spark)
+ import h2oContext.implicits._
+ import org.apache.spark.examples.h2o._
+
+ val path = "examples/smalldata/prostate.csv"
+ val prostateText = spark.sparkContext.textFile(path)
+ val prostateRDD = prostateText.map(_.split(",")).map(row => ProstateParse(row))
+ import _root_.hex.tree.gbm.GBM
+ import _root_.hex.tree.gbm.GBMModel.GBMParameters
+ val train: H2OFrame = prostateRDD
+ val gbmParams = new GBMParameters()
+ gbmParams._train = train
+ gbmParams._response_column = "CAPSULE"
+ gbmParams._ntrees = 10
+ val gbmModel = new GBM(gbmParams).trainModel.get
+
+8. Integrate with MLlib algorithms:
+
+.. code:: scala
+
+ import org.apache.spark.sql.SparkSession
+ val spark = SparkSession.builder().getOrCreate()
+ import org.apache.spark.h2o._
+ val h2oContext = H2OContext.getOrCreate(spark)
+ import org.apache.spark.examples.h2o._
+
+ import java.io.File
+ val path = "examples/smalldata/prostate.csv"
+ val prostateHF = new H2OFrame(new File(path))
+ val prostateRDD = h2oContext.asRDD[Prostate](prostateHF)
+ import org.apache.spark.mllib.clustering.KMeans
+ import org.apache.spark.mllib.linalg.Vectors
+ val train = prostateRDD.map( v => Vectors.dense(v.CAPSULE.get*1.0, v.AGE.get*1.0, v.DPROS.get*1.0,v.DCAPS.get*1.0, v.GLEASON.get*1.0))
+ val clusters = KMeans.train(train, 5, 20)
\ No newline at end of file
diff --git a/doc/devel/running_examples.rst b/doc/devel/running_examples.rst
new file mode 100644
index 000000000..82e03b9d2
--- /dev/null
+++ b/doc/devel/running_examples.rst
@@ -0,0 +1,66 @@
+Running Sparkling Water Examples
+--------------------------------
+
+The Sparkling Water distribution includes also a set of examples. You
+can find their implementation in `examples directory <../../examples/>`__. You
+can build and run them in the following way:
+
+1. Build a package that can be submitted to Spark cluster:
+
+ .. code:: bash
+
+ ./gradlew build -x check
+
+2. Set the configuration of the demo Spark cluster (for example, ``local[*]`` or ``local-cluster[3,2,1024]``)
+
+ .. code:: bash
+
+ export SPARK_HOME="/path/to/spark/installation"
+ export MASTER="local[*]"
+
+ In this example, the description ``local[*]`` causes creation of a single node local cluster.
+
+
+3. And run the example:
+
+- On Local Cluster:
+
+ The cluster is defined by ``MASTER`` address ``local-cluster[3,2,3072]`` which means that cluster contains 3 worker nodes, each having 2 CPU cores and 3GB of memory:
+
+ .. code:: bash
+
+ bin/run-example.sh
+
+- On a Spark Standalone Cluster:
+
+ - Run the Spark cluster, for example via:
+
+ .. code:: bash
+
+ bin/launch-spark-cloud.sh
+
+
+ - Verify that Spark is running: The Spark UI on http://localhost:8080/ should show 3 worker nodes
+ - Export ``MASTER`` address of Spark master using:
+
+ .. code:: bash
+
+ export MASTER="spark://localhost:7077"
+
+
+ - Run example:
+
+ .. code:: bash
+
+ bin/run-example.sh
+
+
+ - Observe status of the application via Spark UI on http://localhost:8080/
+
+
+For more details about examples, please see the
+`README <../../examples/README.rst>`__ file in the `examples directory <../../examples/>`__.
+
+Additional Examples
+~~~~~~~~~~~~~~~~~~~
+Additional examples are available at `examples folder <../../examples/>`__.
diff --git a/doc/devel/unit_tests.rst b/doc/devel/unit_tests.rst
new file mode 100644
index 000000000..5dcb8bafc
--- /dev/null
+++ b/doc/devel/unit_tests.rst
@@ -0,0 +1,13 @@
+Running Unit Tests
+------------------
+
+To invoke tests for example from IntelliJ Idea, the following JVM
+options are required:
+
+- ``-Dspark.testing=true``
+
+To invoke unit tests from gradle, run:
+
+.. code:: shell
+
+ ./gradlew build -x integTest -x scriptTest
\ No newline at end of file
diff --git a/doc/images/external_backend.png b/doc/images/external_backend.png
new file mode 100644
index 000000000..86ca4a5d9
Binary files /dev/null and b/doc/images/external_backend.png differ
diff --git a/doc/images/internal_backend.png b/doc/images/internal_backend.png
new file mode 100644
index 000000000..a2e373455
Binary files /dev/null and b/doc/images/internal_backend.png differ
diff --git a/design-doc/images/DataShare.png b/doc/images/internal_backend_data_sharing.png
similarity index 100%
rename from design-doc/images/DataShare.png
rename to doc/images/internal_backend_data_sharing.png
diff --git a/doc/requirements.rst b/doc/requirements.rst
new file mode 100644
index 000000000..a9856ca29
--- /dev/null
+++ b/doc/requirements.rst
@@ -0,0 +1,8 @@
+Sparkling Water Requirements
+----------------------------
+
+- Linux/OS X/Windows
+- Java 7+
+- Python 2.6+ For Python version of Sparkling Water (PySparkling)
+- `Spark 1.6+ `__ and ``SPARK_HOME`` shell variable must point to your local Spark installation
+
diff --git a/doc/rest_api/rest_api.rst b/doc/rest_api/rest_api.rst
new file mode 100644
index 000000000..465870b20
--- /dev/null
+++ b/doc/rest_api/rest_api.rst
@@ -0,0 +1,4 @@
+Sparkling Water REST API
+------------------------
+
+- `Scala Interpreter REST API `__
diff --git a/doc/rest_api/scala_interpreter_endpoints.rst b/doc/rest_api/scala_interpreter_endpoints.rst
new file mode 100644
index 000000000..1d5bf89bc
--- /dev/null
+++ b/doc/rest_api/scala_interpreter_endpoints.rst
@@ -0,0 +1,92 @@
+Scala interpreter available through REST API
+--------------------------------------------
+
+Basic design
+~~~~~~~~~~~~
+
+There is a simple pool of interpreters created at the start. Once an
+scala interpreter is associated with the session, new interpreter is
+added to the pool. Each interpreter is deleted when it's not used for
+some fixed time.
+
+Example usage of scala interpreter using REST API
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Here, you can find some basic calls to access scala interpreter behind
+REST API using curl.
+
+Init interpreter and obtain session ID
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code:: bash
+
+ curl --data-urlencode '' http://192.168.0.10:54321/3/scalaint
+
+Destroy session and interpreter associated with the session
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code:: bash
+
+ curl -X DELETE
+ http://192.168.0.10:54321/3/scalaint/512ef484-e21a-48f9-979e-2879f63a779e
+
+Get all active sessions
+^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code:: bash
+
+ curl http://192.168.0.10:54321/3/scalaint
+
+Interpret the incomplete code, status is error
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code:: bash
+
+ curl --data-urlencode code='sc.'
+ http://192.168.0.10:54321/3/scalaint/c3e5ea38-0b7e-4136-9ba3-21615ea2d298
+
+try to interpret the code, status is error (function does not exist)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code:: bash
+
+ curl --data-urlencode code='foo()'
+ http://192.168.0.10:54321/3/scalaint/c3e5ea38-0b7e-4136-9ba3-21615ea2d298
+
+Interpret the code, result is success
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code:: bash
+
+ curl --data-urlencode code='21+21'
+ http://192.168.0.10:54321/3/scalaint/c3e5ea38-0b7e-4136-9ba3-21615ea2d298
+
+Interpret the code with the spark context, use semicolon to separate commands
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code:: bash
+
+ curl --data-urlencode code='val data = Array(1, 2, 3, 4, 5); val
+ distData = sc.parallelize(data); val result = distData.map(s => s+10)'
+ http://192.168.0.10:54321/3/scalaint/c3e5ea38-0b7e-4136-9ba3-21615ea2d298
+
+Interpret the code with the spark context, use new lines to separate commands
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code:: bash
+
+ curl --data-urlencode code=' val data = Array(1, 2, 3, 4, 5) val
+ distData = sc.parallelize(data) val result = distData.map(s => s+10)'
+ http://192.168.0.10:54321/3/scalaint/c3e5ea38-0b7e-4136-9ba3-21615ea2d298
+
+Declare class and use it in the next call
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code:: bash
+
+ curl --data-urlencode code=' case class A(number: Int)'
+ http://192.168.0.10:54321/3/scalaint/c3e5ea38-0b7e-4136-9ba3-21615ea2d298
+
+ curl --data-urlencode code=' val data = Array(1, 2, 3, 4, 5) val
+ distData = sc.parallelize(data) val result = distData.map(s => A(s))'
+ http://192.168.0.10:54321/3/scalaint/c3e5ea38-0b7e-4136-9ba3-21615ea2d298
diff --git a/doc/running_examples.rst b/doc/running_examples.rst
deleted file mode 100644
index e6c19e8ec..000000000
--- a/doc/running_examples.rst
+++ /dev/null
@@ -1,36 +0,0 @@
-Running Sparkling Water examples
---------------------------------
-
-The Sparkling Water distribution includes also a set of examples. You
-can find their implementation in `examples directory `__. You
-can run them in the following way:
-
-1. Build a package that can be submitted to Spark cluster:
-
- .. code:: bash
-
- ./gradlew build -x check
-
-2. Set the configuration of the demo Spark cluster (for example,
- ``local[*]`` or ``local-cluster[3,2,1024]``)
-
- .. code:: bash
-
- export SPARK_HOME="/path/to/spark/installation"
- export MASTER="local[*]"
-
- In this example, the description ``local[*]`` causes creation of a single node local cluster.
-
-3. And run the example:
-
- .. code:: bash
-
- bin/run-example.sh
-
-
-For more details about examples, please see the
-`README `__ file in the `examples directory `__.
-
-Additional Examples
-~~~~~~~~~~~~~~~~~~~
-Additional examples are available at `examples folder `__.
diff --git a/doc/backends.rst b/doc/tutorials/backends.rst
similarity index 99%
rename from doc/backends.rst
rename to doc/tutorials/backends.rst
index ae4ab8aeb..39c686c33 100644
--- a/doc/backends.rst
+++ b/doc/tutorials/backends.rst
@@ -1,3 +1,6 @@
+Sparkling Water Backends
+========================
+
Internal backend
----------------
diff --git a/doc/tutorials/calling_h2o_algos.rst b/doc/tutorials/calling_h2o_algos.rst
new file mode 100644
index 000000000..a0a3d3d29
--- /dev/null
+++ b/doc/tutorials/calling_h2o_algos.rst
@@ -0,0 +1,29 @@
+Calling H2O Algorithms
+----------------------
+
+1. Create the parameters object that holds references to input data and
+ parameters specific for the algorithm:
+
+.. code:: scala
+
+ val train: RDD = ...
+ val valid: H2OFrame = ...
+
+ val gbmParams = new GBMParameters()
+ gbmParams._train = train
+ gbmParams._valid = valid
+ gbmParams._response_column = "bikes"
+ gbmParams._ntrees = 500
+ gbmParams._max_depth = 6
+
+2. Create a model builder:
+
+.. code:: scala
+
+ val gbm = new GBM(gbmParams)
+
+3. Invoke the model build job and block until the end of computation (``trainModel`` is an asynchronous call by default):
+
+.. code:: scala
+
+ val gbmModel = gbm.trainModel.get
diff --git a/doc/tutorials/change_log_level.rst b/doc/tutorials/change_log_level.rst
new file mode 100644
index 000000000..53a3e1397
--- /dev/null
+++ b/doc/tutorials/change_log_level.rst
@@ -0,0 +1,29 @@
+Change Sparkling Shell Logging Level
+------------------------------------
+
+The console output for Sparkling Water Shell by default shows verbose
+Spark output as well as H2O logs. If you would like to switch the output
+to only warnings from Spark, you need to change it in the log4j
+properties file in the Spark's configuration directory. To do this:
+
+.. code:: shell
+
+ cd $SPARK_HOME/conf
+ cp log4j.properties.template log4j.properties
+
+Then either in a text editor or vim change the contents of the
+log4j.properties file from:
+
+.. code:: shell
+
+ #Set everything to be logged to the console
+ log4j.rootCategory=INFO, console
+ ...
+
+to:
+
+.. code:: shell
+
+ #Set everything to be logged to the console
+ log4j.rootCategory=WARN, console
+ ...
diff --git a/doc/tutorials/h2o_frame_from_key.rst b/doc/tutorials/h2o_frame_from_key.rst
new file mode 100644
index 000000000..4af9f5ad0
--- /dev/null
+++ b/doc/tutorials/h2o_frame_from_key.rst
@@ -0,0 +1,11 @@
+Creating H2OFrame from an Existing Key
+--------------------------------------
+
+If the H2O cluster already contains a loaded ``H2OFrame`` referenced by
+the key ``train.hex``, it is possible to reference it from Sparkling
+Water by creating a proxy ``H2OFrame`` instance using the key as the
+input:
+
+.. code:: scala
+
+ val trainHF = new H2OFrame("train.hex")
diff --git a/doc/tutorials/h2oframe_as_data_source.rst b/doc/tutorials/h2oframe_as_data_source.rst
new file mode 100644
index 000000000..6962f08cf
--- /dev/null
+++ b/doc/tutorials/h2oframe_as_data_source.rst
@@ -0,0 +1,164 @@
+H2O Frame as Spark's Data Source
+--------------------------------
+
+The way how H2O Frame can be used as Spark's Data Source differs a
+little bit in Python and Scala.
+
+Quick links:
+
+- `Usage in Python - PySparkling`_
+- `Usage in Scala`_
+- `Specifying Saving Mode`_
+
+Usage in Python - PySparkling
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Reading from H2O Frame
+^^^^^^^^^^^^^^^^^^^^^^
+
+Let's suppose we have H2OFrame ``frame``.
+
+There are two ways how dataframe can be loaded from H2OFrame in
+PySparkling:
+
+.. code:: python
+
+ df = spark.read.format("h2o").option("key", frame.frame_id).load()
+
+or
+
+.. code:: python
+
+ df = spark.read.format("h2o").load(frame.frame_id)
+
+Saving to H2O Frame
+^^^^^^^^^^^^^^^^^^^
+
+Let's suppose we have DataFrame ``df``.
+
+There are two ways how dataframe can be saved as H2OFrame in
+PySparkling:
+
+.. code:: python
+
+ df.write.format("h2o").option("key", "new_key").save()
+
+or
+
+.. code:: python
+
+ df.write.format("h2o").save("new_key")
+
+Both variants save dataframe as H2OFrame with key ``new_key``. They
+don't succeed if the H2OFrame with the same key already exists.
+
+Loading & Saving Options
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+If the key is specified as ``key`` option and also in the ``load/save``
+method, the option ``key`` is preferred
+
+.. code:: python
+
+ df = spark.read.from("h2o").option("key", "key_one").load("key_two")
+
+or
+
+.. code:: python
+
+ df = spark.read.from("h2o").option("key", "key_one").save("key_two")
+
+In both examples, ``key_one`` is used.
+
+Usage in Scala
+~~~~~~~~~~~~~~
+
+Reading from H2O Frame
+^^^^^^^^^^^^^^^^^^^^^^
+
+Let's suppose we have H2OFrame ``frame``.
+
+The shortest way how dataframe can be loaded from H2OFrame with default
+settings is:
+
+.. code:: scala
+
+ val df = spark.read.h2o(frame.key)
+
+There are two more ways how dataframe can be loaded from H2OFrame. These calls allow
+us to specify additional options:
+
+.. code:: scala
+
+ val df = spark.read.format("h2o").option("key", frame.key.toString).load()
+
+or
+
+.. code:: scala
+
+ val df = spark.read.format("h2o").load(frame.key.toString)
+
+Saving to H2O Frame
+^^^^^^^^^^^^^^^^^^^
+
+Let's suppose we have DataFrame ``df``.
+
+The shortest way how dataframe can be saved as H2O Frame with default
+settings is:
+
+.. code:: scala
+
+ df.write.h2o("new_key")
+
+There are two more ways how dataframe can be saved as H2OFrame. These calls allow
+us to specify additional options:
+
+.. code:: scala
+
+ df.write.format("h2o").option("key", "new_key").save()
+
+or
+
+.. code:: scala
+
+ df.write.format("h2o").save("new_key")
+
+All three variants save dataframe as H2OFrame with key ``new_key``. They
+don't succeed if the H2OFrame with the same key already exists.
+
+Loading & Saving Options
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+If the key is specified as ``key`` option and also in the ``load/save``
+method, the option ``key`` is preferred
+
+.. code:: scala
+
+ val df = spark.read.from("h2o").option("key", "key_one").load("key_two")
+
+or
+
+.. code:: scala
+
+ val df = spark.read.from("h2o").option("key", "key_one").save("key_two")
+
+In both examples, ``key_one`` is used.
+
+Specifying Saving Mode
+~~~~~~~~~~~~~~~~~~~~~~
+
+There are 4 save modes available when saving data using Data Source
+API - ``append``, ``overwrite``, ``error`` and ``ignore``. The full description is available at `Spark Save Modes `__.
+
+- If ``append`` mode is used, an existing H2OFrame with the same key is
+ deleted and new one containing union of all rows from original H2O Frame
+ and appended Data Frame is created with the same key.
+
+- If ``overwrite`` mode is used, an existing H2OFrame with the same key is
+ deleted and new one with the new rows is created with the same key.
+
+- If ``error`` mode is used and a H2OFrame with the specified key already
+ exists, exception is thrown.
+
+- If ``ignore`` mode is used and a H2OFrame with the specified key already
+ exists, no data is changed.
diff --git a/doc/tutorials/obtaining_logs.rst b/doc/tutorials/obtaining_logs.rst
new file mode 100644
index 000000000..ad376c882
--- /dev/null
+++ b/doc/tutorials/obtaining_logs.rst
@@ -0,0 +1,34 @@
+Obtain Sparkling Water Logs
+---------------------------
+
+Depending on how you launched H2O there are a couple of ways to obtain
+the logs.
+
+Logs for Sparkling Water on YARN
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When launching Sparkling Water on YARN, you can find the application id
+for the Yarn job on the resource manager where you can also find the
+application master which is also the Spark master. Following command prints
+the yarn logs to the console:
+
+.. code:: shell
+
+ yarn logs -applicationId
+
+
+Logs for Standalone Sparkling Water
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By default, Spark property ``SPARK_LOG_DIR`` is set to
+``$SPARK_HOME/work/``. To also log the configuration with which the
+Spark is started, start Sparkling Water with the following
+configuration:
+
+.. code:: shell
+
+ bin/sparkling-shell.sh --conf spark.logConf=true
+
+The logs for the particular application are located at
+``$SPARK_HOME/work/``. The directory contains also
+stdout and stderr for each node in the cluster.
diff --git a/doc/windows_manual.rst b/doc/tutorials/run_on_windows.rst
similarity index 96%
rename from doc/windows_manual.rst
rename to doc/tutorials/run_on_windows.rst
index c44a9a2df..5e46cab8b 100644
--- a/doc/windows_manual.rst
+++ b/doc/tutorials/run_on_windows.rst
@@ -1,4 +1,4 @@
-Use Sparkling Water in Windows environments
+Use Sparkling Water in Windows Environments
-------------------------------------------
The Windows environments require several additional steps to make Spark
diff --git a/doc/tutorials/run_sparkling_water.rst b/doc/tutorials/run_sparkling_water.rst
new file mode 100644
index 000000000..6c237f499
--- /dev/null
+++ b/doc/tutorials/run_sparkling_water.rst
@@ -0,0 +1,38 @@
+Running Sparkling Water
+-----------------------
+
+In order to run Sparkling Water, the environment must contain the
+property ``SPARK_HOME`` that points to the Spark distribution.
+
+H2O on Spark can be started in the Spark Shell or in the Spark
+application as:
+
+.. code:: bash
+
+ ./bin/sparkling-shell
+
+Sparkling Water ( H2O on Spark) can be initiated using the following call:
+
+.. code:: scala
+
+ val hc = H2OContext.getOrCreate(spark)
+
+The semantic of the call depends on the configured Sparkling Water
+backend. For more information about the backends, please see `Sparkling
+Water Backends `__.
+
+In internal backend mode, the call will:
+
+1. Collect the number and host names of the executors (worker nodes) in the Spark cluster
+2. Launch H2O services on each detected executor
+3. Create a cloud for H2O services based on the list of executors
+4. Verify the H2O cloud status
+
+In external backend mode, the call will:
+
+1. Start H2O in client mode on the Spark driver
+2. Start separated H2O cluster on the configured YARN queue
+3. Connects to the external cluster from the H2O client
+
+
+To see how to run Sparkling Water on Windows, please visit `Run on Windows `__.
\ No newline at end of file
diff --git a/doc/tutorials/security.rst b/doc/tutorials/security.rst
new file mode 100644
index 000000000..87c1065f8
--- /dev/null
+++ b/doc/tutorials/security.rst
@@ -0,0 +1,47 @@
+Enabling Security
+-----------------
+
+Both Spark and H2O support basic node authentication and data
+encryption. In H2O's case we encrypt all the data sent between server
+nodes and between client and server nodes. This feature does not support
+H2O's UDP feature, only data sent via TCP is encrypted.
+
+Currently only encryption based on Java's key pair is supported (more
+in-depth explanation can be found in H2O's documentation linked below).
+
+To enable security for Spark methods please check `their
+documentation `__.
+
+Security for data exchanged between H2O instances can be enabled
+manually by generating all necessary files and distributing them to all
+worker nodes as described in `H2O's
+documentation `__
+and passing the ``spark.ext.h2o.internal_security_conf`` to spark
+submit:
+
+.. code:: shell
+
+ bin/sparkling-shell --conf "spark.ext.h2o.internal_security_conf=ssl.properties"
+
+We also provide utility methods which automatically generate all
+necessary files and enable security on all H2O nodes:
+
+.. code:: scala
+
+ import org.apache.spark.network.Security
+ import org.apache.spark.h2o._
+ Security.enableSSL(spark) // generate properties file, key pairs and set appropriate H2O parameters
+ val hc = H2OContext.getOrCreate(spark) // start the H2O cloud
+
+Or if you plan on passing your own H2OConf then please use:
+
+.. code:: scala
+
+ import org.apache.spark.network.Security
+ import org.apache.spark.h2o._
+ val conf: H2OConf = // generate H2OConf file
+ Security.enableSSL(spark, conf) // generate properties file, key pairs and set appropriate H2O parameters
+ val hc = H2OContext.getOrCreate(spark, conf) // start the H2O cloud
+
+This method generates all files and distributes them via YARN or
+Spark methods to all worker nodes. This communication is secure in case of configured YARN/Spark security.
diff --git a/doc/tutorials/spark_h2o_conversions.rst b/doc/tutorials/spark_h2o_conversions.rst
new file mode 100644
index 000000000..cb57841ed
--- /dev/null
+++ b/doc/tutorials/spark_h2o_conversions.rst
@@ -0,0 +1,141 @@
+Spark Frame <--> H2O Frame Conversions
+--------------------------------------
+
+Quick links:
+
+- `Converting H2OFrame into RDD[T]`_
+- `Converting H2OFrame into DataFrame`_
+- `Converting RDD[T] into H2OFrame`_
+- `Converting DataFrame into H2OFrame`_
+
+Converting H2OFrame into RDD[T]
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``H2OContext`` class provides the explicit conversion, ``asRDD``,
+which creates an RDD-like wrapper around the provided H2O's H2OFrame:
+
+.. code:: scala
+
+ def asRDD[A <: Product: TypeTag: ClassTag](fr : H2OFrame) : RDD[A]
+
+The call expects the type ``A`` to create a correctly-typed RDD. The
+conversion requires type ``A`` to be bound by ``Product`` interface. The
+relationship between the columns of H2OFrame and the attributes of class
+``A`` is based on name matching.
+
+Example
+^^^^^^^
+
+.. code:: scala
+
+ val df: H2OFrame = ...
+ val rdd = asRDD[Weather](df)
+
+--------------
+
+Converting H2OFrame into DataFrame
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``H2OContext`` class provides the explicit conversion,
+``asDataFrame``, which creates a DataFrame-like wrapper around the
+provided H2O H2OFrame. Technically, it provides the ``RDD[sql.Row]`` RDD
+API:
+
+.. code:: scala
+
+ def asDataFrame(fr : H2OFrame)(implicit sqlContext: SQLContext) : DataFrame
+
+This call does not require any type of parameters, but since it creates
+``DataFrame`` instances, it requires access to an instance of
+``SQLContext``. In this case, the instance is provided as an implicit
+parameter of the call. The parameter can be passed in two ways: as an
+explicit parameter or by introducing an implicit variable into the
+current context.
+
+The schema of the created instance of the ``DataFrame`` is derived from
+the column name and the types of ``H2OFrame`` specified.
+
+Example
+^^^^^^^
+
+Using an explicit parameter in the call to pass sqlContext:
+
+.. code:: scala
+
+ val sqlContext = new SQLContext(sc)
+ val schemaRDD = asDataFrame(h2oFrame)(sqlContext)
+
+or as implicit variable provided by actual environment:
+
+.. code:: scala
+
+ implicit val sqlContext = new SQLContext(sc)
+ val schemaRDD = asDataFrame(h2oFrame)
+
+--------------
+
+Converting RDD[T] into H2OFrame
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``H2OContext`` provides **implicit** conversion from the specified
+``RDD[A]`` to ``H2OFrame``. As with conversion in the opposite
+direction, the type ``A`` has to satisfy the upper bound expressed by
+the type ``Product``. The conversion will create a new ``H2OFrame``,
+transfer data from the specified RDD, and save it to the H2O K/V data
+store.
+
+.. code:: scala
+
+ implicit def asH2OFrame[A <: Product : TypeTag](rdd : RDD[A]) : H2OFrame
+
+The API also provides explicit version which allows for specifying name
+for resulting H2OFrame.
+
+.. code:: scala
+
+ def asH2OFrame[A <: Product : TypeTag](rdd : RDD[A], frameName: Option[String]) : H2OFrame
+
+Example
+^^^^^^^
+
+.. code:: scala
+
+ val rdd: RDD[Weather] = ...
+ import h2oContext.implicits._
+ // implicit call of H2OContext.asH2OFrame[Weather](rdd) is used
+ val hf: H2OFrame = rdd
+ // Explicit call of of H2OContext API with name for resulting H2O frame
+ val hfNamed: H2OFrame = h2oContext.asH2OFrame(rdd, Some("h2oframe"))
+
+--------------
+
+Converting DataFrame into H2OFrame
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``H2OContext`` provides **implicit** conversion from the specified
+``DataFrame`` to ``H2OFrame``. The conversion will create a new
+``H2OFrame``, transfer data from the specified ``DataFrame``, and save
+it to the H2O K/V data store.
+
+.. code:: scala
+
+ implicit def asH2OFrame(rdd : DataFrame) : H2OFrame
+
+The API also provides explicit version which allows for specifying name
+for resulting H2OFrame.
+
+.. code:: scala
+
+ def asH2OFrame(rdd : DataFrame, frameName: Option[String]) : H2OFrame
+
+Example
+^^^^^^^
+
+.. code:: scala
+
+ val df: DataFrame = ...
+ import h2oContext.implicits._
+ // Implicit call of H2OContext.asH2OFrame(srdd) is used
+ val hf: H2OFrame = df
+ // Explicit call of H2Context API with name for resulting H2O frame
+ val hfNamed: H2OFrame = h2oContext.asH2OFrame(df, Some("h2oframe"))
diff --git a/doc/tutorials/tutorials.rst b/doc/tutorials/tutorials.rst
new file mode 100644
index 000000000..b3434c768
--- /dev/null
+++ b/doc/tutorials/tutorials.rst
@@ -0,0 +1,21 @@
+Tutorials
+---------
+
+- `Running Sparkling Water `__
+- `Running Sparkling Water on Windows `__
+- `Sparkling Water Backends Run & Configuration `__
+- `Enabling Security `__
+- `Calling H2O Algorithms `__
+- Frames Conversions & Creation
+
+ - `Spark - H2O Frame Conversions `__
+ - `H2O Frame as Spark's DataSource `__
+ - `Create H2OFrame From an Existing Key `__
+
+- Logging
+
+ - `Change Sparkling Shell Logging Level `__
+ - `Obtain Sparkling Water Logs `__
+
+- `Sparkling Water and Zeppelin `__
+- `Use Sparkling Water as Spark Package `__
diff --git a/doc/spark_package.rst b/doc/tutorials/use_as_spark_package.rst
similarity index 100%
rename from doc/spark_package.rst
rename to doc/tutorials/use_as_spark_package.rst
diff --git a/doc/tutorials/use_on_zeppelin.rst b/doc/tutorials/use_on_zeppelin.rst
new file mode 100644
index 000000000..741fdf199
--- /dev/null
+++ b/doc/tutorials/use_on_zeppelin.rst
@@ -0,0 +1,48 @@
+Sparkling Water and Zeppelin
+----------------------------
+
+Since Sparkling Water exposes Scala API, it is possible to access it
+directly from the Zeppelin's notebook cell marked by ``%spark`` tag.
+
+Launch Zeppelin with Sparkling Water
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Using Sparkling Water from Zeppelin is easy since Sparkling Water is
+distributed as a Spark package. In this case, before launching Zeppelin
+addition shell variable is needed:
+
+.. code:: bash
+
+ export SPARK_HOME=...# Spark 2.1 home
+ export SPARK_SUBMIT_OPTIONS="--packages ai.h2o:sparkling-water-examples_2.11:2.1.0"
+ bin/zeppelin.sh -Pspark-2.1
+
+The command is using Spark 2.1 version and corresponding Sparkling Water
+package.
+
+Using Zeppelin
+~~~~~~~~~~~~~~
+
+The use of Sparkling Water package is directly driven by Sparkling Water
+API. For example, getting ``H2OContext`` is straightforward:
+
+.. code:: scala
+
+ %spark
+ import org.apache.spark.h2o._
+ val hc = H2OContext.getOrCreate(spark)
+
+Creating ``H2OFrame`` from Spark ``DataFrame``:
+
+.. code:: scala
+
+ %spark
+ val df = sc.parallelize(1 to 1000).toDF
+ val hf = hc.asH2OFrame(df)
+
+Creating Spark ``DataFrame`` from ``H2OFrame``:
+
+.. code:: scala
+
+ %spark
+ val df = hc.asDataFrame(hf)
diff --git a/doc/typical_use_case.rst b/doc/typical_use_case.rst
new file mode 100644
index 000000000..3004cf270
--- /dev/null
+++ b/doc/typical_use_case.rst
@@ -0,0 +1,9 @@
+Typical Use Case
+----------------
+
+Sparkling Water excels in leveraging existing Spark-based workflows that
+need to call advanced machine learning algorithms. A typical example
+involves data munging with help of Spark API, where a prepared table is
+passed to the H2O DeepLearning algorithm. The constructed DeepLearning
+model estimates different metrics based on the testing data, which can
+be used in the rest of the Spark workflow.
diff --git a/docker/README.md b/docker/README.md
deleted file mode 100644
index 237313736..000000000
--- a/docker/README.md
+++ /dev/null
@@ -1,38 +0,0 @@
-# Docker Support
-
-## Create DockerFile
-
-Docker file can be created by calling ```./gradlew createDockerFile -PsparkVersion=version```, where ```version``` is spark version
-for which to generate docker file. Latest corresponding sparkling water release to be used within docker image is determined
-automatically based on spark version.
-
-The gradle task can be called without the parameter as ```./gradlew createDockerFile```, which creates docker file for spark version
-defined in gradle.properties file.
-
-## Container requirements
-To run Sparkling Water in the container, the host has to provide a machine with at least 5G of total memory.
-If this is not met, Sparkling Water scripts print warning but still attempt to run.
-
-## Building a container
-
-```
-$ cd docker && ./build.sh
-```
-
-## Run bash inside container
-
-```
-$ cd docker && docker run -i -t sparkling-water-base /bin/bash
-```
-
-## Run Sparkling Shell inside container
-
-```
-$ cd docker && docker run -i -t --rm sparkling-water-base bin/sparkling-shell
-```
-
-## Running examples in container
-
-```
-$ cd docker && docker run -i -t --rm sparkling-water-base bin/run-example.sh
-```
diff --git a/docker/README.rst b/docker/README.rst
new file mode 100644
index 000000000..b051b3996
--- /dev/null
+++ b/docker/README.rst
@@ -0,0 +1,50 @@
+Docker Support
+==============
+
+Create DockerFile
+-----------------
+
+Docker file can be created by calling
+``./gradlew createDockerFile -PsparkVersion=version``, where ``version``
+is spark version for which to generate docker file. Latest corresponding
+sparkling water release to be used within docker image is determined
+automatically based on spark version.
+
+The gradle task can be called without the parameter as
+``./gradlew createDockerFile``, which creates docker file for spark
+version defined in gradle.properties file.
+
+Container requirements
+----------------------
+
+To run Sparkling Water in the container, the host has to provide a
+machine with at least 5G of total memory. If this is not met, Sparkling
+Water scripts print warning but still attempt to run.
+
+Building a container
+--------------------
+
+.. code:: bash
+
+ $ cd docker && ./build.sh
+
+Run bash inside container
+-------------------------
+
+.. code:: bash
+
+ $ cd docker && docker run -i -t sparkling-water-base /bin/bash
+
+Run Sparkling Shell inside container
+------------------------------------
+
+.. code:: bash
+
+ $ cd docker && docker run -i -t --rm sparkling-water-base bin/sparkling-shell
+
+Running examples in container
+-----------------------------
+
+.. code:: bash
+
+ $ cd docker && docker run -i -t --rm sparkling-water-base bin/run-example.sh
diff --git a/examples/README.md b/examples/README.md
deleted file mode 100644
index a330ba3ff..000000000
--- a/examples/README.md
+++ /dev/null
@@ -1,206 +0,0 @@
-#Sparkling Water Table of Contents
-- [Compiling examples](#CompileExample)
-- [Running examples](#RunExample)
- - [Running on a local cluster](#LocalCluster)
- - [Running on a Spark cluster](#SparkCluster)
-- [Configuring variables](#ConfigVar)
-- [Step-by-step weather example](#WeatherExample)
-- [Running Sparkling Water on Hadoop](#Hadoop)
-- [Importing data from HDFS](#ImportData)
-
-
-# Sparkling Water Examples
-
-## Available Demos And Applications
- * [`CraigslistJobTitlesStreamingApp`](src/main/scala/org/apache/spark/examples/h2o/CraigslistJobTitlesStreamingApp.scala) - **stream** application - it predicts job category based on incoming job description
- * [`CraigslistJobTitlesApp`](src/main/scala/org/apache/spark/examples/h2o/CraigslistJobTitlesApp.scala) - predict job category based on posted job description
- * [`ChicagoCrimeAppSmall`](src/main/scala/org/apache/spark/examples/h2o/ChicagoCrimeAppSmall.scala) - builds a model predicting a probability of arrest for given crime in Chicago using data in [`smalldata` directory](smalldata/)
- * [`ChicagoCrimeApp`](src/main/scala/org/apache/spark/examples/h2o/ChicagoCrimeApp.scala) - implementation of Chicago Crime demo with setup for data stored on HDFS
- * [`CitiBikeSharingDemo`](src/main/scala/org/apache/spark/examples/h2o/CitiBikeSharingDemo.scala) - predicts occupancy of Citi bike stations in NYC
- * [`HamOrSpamDemo`](src/main/scala/org/apache/spark/examples/h2o/HamOrSpamDemo.scala) - shows Spam detector with Spark and H2O's DeepLearning
- * [`ProstateDemo`](src/main/scala/org/apache/spark/examples/h2o/ProstateDemo.scala) - running K-means on prostate dataset (see
- _smalldata/prostate.csv_)
- * [`DeepLearningDemo`](src/main/scala/org/apache/spark/examples/h2o/DeepLearningDemo.scala) - running DeepLearning on a subset of airlines dataset (see
- _smalldata/allyears2k\_headers.csv.gz_)
- * [`AirlinesWithWeatherDemo`](src/main/scala/org/apache/spark/examples/h2o/AirlinesWithWeatherDemo.scala) - joining flights data with weather data and running
- Deep Learning
- * [`AirlinesWithWeatherDemo2`](src/main/scala/org/apache/spark/examples/h2o/AirlinesWithWeatherDemo2.scala) - new iteration of `AirlinesWithWeatherDemo`
-
-
-> Run examples by typing `bin/run-example.sh ` or follow text below.
-
-## Available Demos for Sparkling Shell
- * [`chicagoCrimeSmallShell.script.scala`](scripts/chicagoCrimeSmallShell.script.scala) - demo showing full source code of predicting arrest probability for a given crime. It covers whole machine learning process from loading and transforming data, building models, scoring incoming events.
- * [`chicagoCrimeSmall.script.scala`](scripts/chicagoCrimeSmall.script.scala) - example of using [ChicagoCrimeApp](src/main/scala/org/apache/spark/examples/h2o/ChicagoCrimeApp.scala) - creating application and using it for scoring individual crime events.
- * [`mlconf_2015_hamSpam.script.scala`](scripts/mlconf_2015_hamSpam.script.scala) - HamOrSpam application which detectes Spam messages. Presented at MLConf 2015 NYC.
- * [`strata2015_demo.scala`](scripts/strata2015_demo.scala) - NYC CitiBike demo presented at Strata 2015 in San Jose.
- * [`StrataAirlines.scala`](scripts/StrataAirlines.scala) - example of using flights and weather data to predict delay of a flight
-
-> Run examples by typing `bin/sparkling-shell -i `
-
------
-
-
-## Compiling Examples
-To compile, use top-level `gradlew`:
-```
-./gradlew assemble
-```
----
-
-## Running Examples
-
-
-### On a Simple Local Cluster
-
- Run a given example on local cluster. The cluster is defined by `MASTER` address `local-cluster[3,2,3072]` which means that cluster contains 3 worker nodes, each having 2CPUs and 3GB of memory:
- * Run `bin/run-example.sh `
-
----
-
-### On a Spark Cluster
- * Run the Spark cluster, for example via `bin/launch-spark-cloud.sh`
- * Verify that Spark is running: The Spark UI on `http://localhost:8080/` should show 3 worker nodes
- * Export `MASTER` address of Spark master using `export MASTER="spark://localhost:7077"`
- * Run `bin/run-example.sh `
- * Observe status of the application via Spark UI on `http://localhost:8080/`
-
----
-
-## Configuring Sparkling Water Variables
-
-You can configure Sparkling Water using the following variables:
- * `spark.h2o.cloud.timeout` - number of milliseconds to wait for cloud formation
- * `spark.h2o.workers` - number of expected H2O workers; it should be same as number of Spark workers
- * `spark.h2o.preserve.executors` - do not kill executors via calling `sc.stop()` call
-
----
-
-## Step-by-Step Weather Data Example
-
-1. Run Sparkling shell with an embedded cluster:
- ```
- export SPARK_HOME="/path/to/spark/installation"
- export MASTER="local[*]"
- bin/sparkling-shell
- ```
-
-2. To see the Sparkling shell (i.e., Spark driver) status, go to [http://localhost:4040/](http://localhost:4040/).
-
-3. Initialize H2O services on top of Spark cluster:
- ```scala
- import org.apache.spark.h2o._
- val h2oContext = H2OContext.getOrCreate(sc)
- import h2oContext._
- import h2oContext.implicits._
- ```
-
-4. Load weather data for Chicago international airport (ORD), with help from the RDD API:
- ```scala
- import org.apache.spark.examples.h2o._
- val weatherDataFile = "examples/smalldata/Chicago_Ohare_International_Airport.csv"
- val wrawdata = sc.textFile(weatherDataFile,3).cache()
- val weatherTable = wrawdata.map(_.split(",")).map(row => WeatherParse(row)).filter(!_.isWrongRow())
- ```
-
-5. Load airlines data using the H2O parser:
- ```scala
- import java.io.File
- val dataFile = "examples/smalldata/allyears2k_headers.csv.gz"
- val airlinesData = new H2OFrame(new File(dataFile))
- ```
-
-6. Select flights destined for Chicago (ORD):
- ```scala
- val airlinesTable : RDD[Airlines] = asRDD[Airlines](airlinesData)
- val flightsToORD = airlinesTable.filter(f => f.Dest==Some("ORD"))
- ```
-
-7. Compute the number of these flights:
- ```scala
- flightsToORD.count
- ```
-
-8. Use Spark SQL to join the flight data with the weather data:
- ```scala
- implicit val sqlContext = spark.sqlContext
- import sqlContext.implicits._
- flightsToORD.toDF.createOrReplaceTempView("FlightsToORD")
- weatherTable.toDF.createOrReplaceTempView("WeatherORD")
- ```
-
-9. Perform SQL JOIN on both tables:
- ```scala
- val bigTable = sqlContext.sql(
- """SELECT
- |f.Year,f.Month,f.DayofMonth,
- |f.CRSDepTime,f.CRSArrTime,f.CRSElapsedTime,
- |f.UniqueCarrier,f.FlightNum,f.TailNum,
- |f.Origin,f.Distance,
- |w.TmaxF,w.TminF,w.TmeanF,w.PrcpIn,w.SnowIn,w.CDD,w.HDD,w.GDD,
- |f.ArrDelay
- |FROM FlightsToORD f
- |JOIN WeatherORD w
- |ON f.Year=w.Year AND f.Month=w.Month AND f.DayofMonth=w.Day""".stripMargin)
- ```
-
-10. Transform the first 3 columns containing date information into enum columns:
- ```scala
- val bigDataFrame: H2OFrame = h2oContext.asH2OFrame(bigTable)
- for( i <- 0 to 2) bigDataFrame.replace(i, bigDataFrame.vec(i).toCategoricalVec)
- bigDataFrame.update()
- ```
-
-11. Run deep learning to produce a model estimating arrival delay:
- ```scala
- import _root_.hex.deeplearning.DeepLearning
- import _root_.hex.deeplearning.DeepLearningModel.DeepLearningParameters
- import _root_.hex.deeplearning.DeepLearningModel.DeepLearningParameters.Activation
- val dlParams = new DeepLearningParameters()
- dlParams._train = bigDataFrame
- dlParams._response_column = 'ArrDelay
- dlParams._epochs = 5
- dlParams._activation = Activation.RectifierWithDropout
- dlParams._hidden = Array[Int](100, 100)
-
- // Create a job
- val dl = new DeepLearning(dlParams)
- val dlModel = dl.trainModel.get
- ```
-
-12. Use the model to estimate the delay on the training data:
- ```scala
- val predictionH2OFrame = dlModel.score(bigTable)('predict)
- val predictionsFromModel = asDataFrame(predictionH2OFrame)(sqlContext).collect.map(row => if (row.isNullAt(0)) Double.NaN else row(0))
- ```
-
-13. Generate an R-code producing residual plot:
- ```scala
- import org.apache.spark.examples.h2o.AirlinesWithWeatherDemo2.residualPlotRCode
- residualPlotRCode(predictionH2OFrame, 'predict, bigTable, 'ArrDelay, h2oContext)
- ```
-
-14. Execute generated R-code in RStudio:
- ```R
- #
- # R script for residual plot
- #
- # Import H2O library
- library(h2o)
- # Initialize H2O R-client
- h2o.init()
- # Fetch prediction and actual data, use remembered keys
- pred = h2o.getFrame("dframe_b5f449d0c04ee75fda1b9bc865b14a69")
- act = h2o.getFrame ("frame_rdd_14_b429e8b43d2d8c02899ccb61b72c4e57")
- # Select right columns
- predDelay = pred$predict
- actDelay = act$ArrDelay
- # Make sure that number of rows is same
- nrow(actDelay) == nrow(predDelay)
- # Compute residuals
- residuals = predDelay - actDelay
- # Plot residuals
- compare = cbind (as.data.frame(actDelay$ArrDelay), as.data.frame(residuals$predict))
- nrow(compare)
- plot( compare[,1:2] )
- ```
diff --git a/examples/README.rst b/examples/README.rst
new file mode 100644
index 000000000..2020b51c0
--- /dev/null
+++ b/examples/README.rst
@@ -0,0 +1,235 @@
+Sparkling Water Examples
+========================
+
+Available Demos And Applications
+--------------------------------
+
++-----------------------------------+--------------------------------------------------------------------------+
+| Example | Description |
++===================================+==========================================================================+
+| |CraigslistJobTitlesStreamingApp| | **Stream** application - it predicts job category based on incoming job |
+| | description. |
++-----------------------------------+--------------------------------------------------------------------------+
+| |CraigslistJobTitlesApp| | Predict job category based on posted job description. |
++-----------------------------------+--------------------------------------------------------------------------+
+| |ChicagoCrimeAppSmall| | Builds a model predicting a probability of arrest for given crime in |
+| | Chicago using data in `smalldata directory `__. |
++-----------------------------------+--------------------------------------------------------------------------+
+| |ChicagoCrimeApp| | Implementation of Chicago Crime demo with setup for data stored on HDFS. |
++-----------------------------------+--------------------------------------------------------------------------+
+| |CitiBikeSharingDemo| | Predicts occupancy of Citi bike stations in NYC. |
++-----------------------------------+--------------------------------------------------------------------------+
+| |HamOrSpamDemo| | Shows Spam detector with Spark and H2O's DeepLearning. |
++-----------------------------------+--------------------------------------------------------------------------+
+| |ProstateDemo| | Running K-means on `prostate dataset `__. |
++-----------------------------------+--------------------------------------------------------------------------+
+| |DeepLearningDemo| | Running DeepLearning on a subset of |
+| | `airlines dataset `__. |
++-----------------------------------+--------------------------------------------------------------------------+
+| |AirlinesWithWeatherDemo| | Joining flights data with weather data and running Deep Learning. |
++-----------------------------------+--------------------------------------------------------------------------+
+| |AirlinesWithWeatherDemo2| | New iteration of ``AirlinesWithWeatherDemo``. |
++-----------------------------------+--------------------------------------------------------------------------+
+
+ Run examples by typing ``bin/run-example.sh `` or follow text below.
+
+Available Demos for Sparkling Shell
+-----------------------------------
+
++-----------------------------------+--------------------------------------------------------------------------+
+| Example | Description |
++===================================+==========================================================================+
+| |chicagoCrimeSmallShellScript| | Demo showing full source code of predicting arrest probability for a |
+| | given crime. It covers whole machine learning process from loading and |
+| | transforming data, building models, scoring incoming events. |
++-----------------------------------+--------------------------------------------------------------------------+
+| |chicagoCrimeSmallScript| | Example of using |ChicagoCrimeApp|. Creating application and using it |
+| | for scoring individual crime events. |
++-----------------------------------+--------------------------------------------------------------------------+
+| |hamOrSpamScript| | HamOrSpam application which detects Spam messages. Presented at |
+| | MLConf 2015 NYC. |
++-----------------------------------+--------------------------------------------------------------------------+
+| |strata2015Script| | NYC CitiBike demo presented at Strata 2015 in San Jose. |
++-----------------------------------+--------------------------------------------------------------------------+
+| |StrataAirlinesScript| | Example of using flights and weather data to predict delay of a flight. |
++-----------------------------------+--------------------------------------------------------------------------+
+
+ Run examples by typing ``bin/sparkling-shell -i ``
+
+--------------
+
+Building and Running Examples
+-----------------------------
+
+Please see `Running Sparkling Water Examples <../doc/devel/running_examples.rst>`__ for more information how to build
+and run examples.
+
+Configuring Sparkling Water Variables
+-------------------------------------
+
+Please see `Available Sparkling Water Configuration Properties <../doc/configuration/configuration_properties.rst>`__ for
+more information about possible Sparkling Water configurations.
+
+Step-by-Step Weather Data Example
+---------------------------------
+
+1. Run Sparkling shell with an embedded cluster:
+
+.. code:: bash
+
+ export SPARK_HOME="/path/to/spark/installation" export MASTER="local[*]" bin/sparkling-shell
+
+2. To see the Sparkling shell (i.e., Spark driver) status, go to http://localhost:4040/.
+
+3. Initialize H2O services on top of Spark cluster:
+
+.. code:: scala
+
+ import org.apache.spark.h2o._
+ val h2oContext = H2OContext.getOrCreate(spark)
+ import h2oContext._
+ import h2oContext.implicits._
+
+4. Load weather data for Chicago international airport (ORD), with help
+ from the RDD API:
+
+.. code:: scala
+
+ import org.apache.spark.examples.h2o._
+ val weatherDataFile = "examples/smalldata/Chicago_Ohare_International_Airport.csv"
+ val wrawdata = spark.sparkContext.textFile(weatherDataFile,3).cache()
+ val weatherTable = wrawdata.map(_.split(",")).map(row => WeatherParse(row)).filter(!_.isWrongRow())
+
+5. Load airlines data using the H2O parser:
+
+.. code:: scala
+
+ import java.io.File
+ val dataFile = "examples/smalldata/allyears2k_headers.csv.gz"
+ val airlinesData = new H2OFrame(new File(dataFile))
+
+6. Select flights destined for Chicago (ORD):
+
+.. code:: scala
+
+ val airlinesTable : RDD[Airlines] = asRDD[Airlines](airlinesData)
+ val flightsToORD = airlinesTable.filter(f => f.Dest==Some("ORD"))
+
+7. Compute the number of these flights:
+
+.. code:: scala
+
+ flightsToORD.count
+
+8. Use Spark SQL to join the flight data with the weather data:
+
+.. code:: scala
+
+ implicit val sqlContext = spark.sqlContext
+ import sqlContext.implicits._
+ flightsToORD.toDF.createOrReplaceTempView("FlightsToORD")
+ weatherTable.toDF.createOrReplaceTempView("WeatherORD")
+
+9. Perform SQL JOIN on both tables:
+
+.. code:: scala
+
+ val bigTable = sqlContext.sql(
+ """SELECT
+ |f.Year,f.Month,f.DayofMonth,
+ |f.CRSDepTime,f.CRSArrTime,f.CRSElapsedTime,
+ |f.UniqueCarrier,f.FlightNum,f.TailNum,
+ |f.Origin,f.Distance,
+ |w.TmaxF,w.TminF,w.TmeanF,w.PrcpIn,w.SnowIn,w.CDD,w.HDD,w.GDD,
+ |f.ArrDelay
+ |FROM FlightsToORD f
+ |JOIN WeatherORD w
+ |ON f.Year=w.Year AND f.Month=w.Month AND f.DayofMonth=w.Day""".stripMargin)
+
+10. Transform the first 3 columns containing date information into enum columns:
+
+.. code:: scala
+
+ val bigDataFrame: H2OFrame = h2oContext.asH2OFrame(bigTable)
+ for( i <- 0 to 2) bigDataFrame.replace(i, bigDataFrame.vec(i).toCategoricalVec)
+ bigDataFrame.update()
+
+11. Run deep learning to produce a model estimating arrival delay:
+
+.. code:: scala
+
+ import _root_.hex.deeplearning.DeepLearning
+ import _root_.hex.deeplearning.DeepLearningModel.DeepLearningParameters
+ import _root_.hex.deeplearning.DeepLearningModel.DeepLearningParameters.Activation
+ val dlParams = new DeepLearningParameters()
+ dlParams._train = bigDataFrame
+ dlParams._response_column = "ArrDelay"
+ dlParams._epochs = 5
+ dlParams._activation = Activation.RectifierWithDropout
+ dlParams._hidden = Array[Int](100, 100)
+
+ // Create a job
+ val dl = new DeepLearning(dlParams)
+ val dlModel = dl.trainModel.get
+
+
+12. Use the model to estimate the delay on the training data:
+
+.. code:: scala
+
+ val predictionH2OFrame = dlModel.score(bigTable)("predict")
+ val predictionsFromModel = asDataFrame(predictionH2OFrame)(sqlContext).collect.map{
+ row => if (row.isNullAt(0)) Double.NaN else row(0)
+ }
+
+13. Generate an R-code producing residual plot:
+
+.. code:: scala
+
+ import org.apache.spark.examples.h2o.AirlinesWithWeatherDemo2.residualPlotRCode
+ residualPlotRCode(predictionH2OFrame, "predict", bigTable, "ArrDelay", h2oContext)
+
+14. Execute generated R-code in RStudio:
+
+.. code:: R
+
+ #
+ # R script for residual plot
+ #
+ # Import H2O library
+ library(h2o)
+ # Initialize H2O R-client
+ h2o.init()
+ # Fetch prediction and actual data, use remembered keys
+ pred = h2o.getFrame("dframe_b5f449d0c04ee75fda1b9bc865b14a69")
+ act = h2o.getFrame ("frame_rdd_14_b429e8b43d2d8c02899ccb61b72c4e57")
+ # Select right columns
+ predDelay = pred$predict
+ actDelay = act$ArrDelay
+ # Make sure that number of rows is same
+ nrow(actDelay) == nrow(predDelay)
+ # Compute residuals
+ residuals = predDelay - actDelay
+ # Plot residuals
+ compare = cbind (as.data.frame(actDelay$ArrDelay), as.data.frame(residuals$predict))
+ nrow(compare)
+ plot( compare[,1:2] )
+
+
+.. Links to the examples
+
+.. |CraigslistJobTitlesStreamingApp| replace:: `CraigslistJobTitlesStreamingApp `__
+.. |CraigslistJobTitlesApp| replace:: `CraigslistJobTitlesApp `__
+.. |ChicagoCrimeAppSmall| replace:: `ChicagoCrimeAppSmall `__
+.. |ChicagoCrimeApp| replace:: `ChicagoCrimeApp `__
+.. |CitiBikeSharingDemo| replace:: `CitiBikeSharingDemo `__
+.. |HamOrSpamDemo| replace:: `HamOrSpamDemo `__
+.. |ProstateDemo| replace:: `ProstateDemo `__
+.. |DeepLearningDemo| replace:: `DeepLearningDemo `__
+.. |AirlinesWithWeatherDemo| replace:: `AirlinesWithWeatherDemo `__
+.. |AirlinesWithWeatherDemo2| replace:: `AirlinesWithWeatherDemo2 `__
+.. |chicagoCrimeSmallShellScript| replace:: `chicagoCrimeSmallShell.script.scala `__
+.. |chicagoCrimeSmallScript| replace:: `chicagoCrimeSmall.script.scala `__
+.. |hamOrSpamScript| replace:: `hamOrSpam.script.scala `__
+.. |strata2015Script| replace:: `strata2015.script.scala `__
+.. |StrataAirlinesScript| replace:: `StrataAirlines.script.scala `__
diff --git a/make-dist.sh b/make-dist.sh
index 4418f4e1a..2e5365071 100755
--- a/make-dist.sh
+++ b/make-dist.sh
@@ -23,7 +23,7 @@ cat > "$TOPDIR/demofiles.list" <