From fbab7d7492a32d4b4df8b585987c37b6cc644589 Mon Sep 17 00:00:00 2001 From: angela0xdata Date: Fri, 23 Jun 2017 07:41:19 -0700 Subject: [PATCH 1/2] PUBDEV-4213: Updates to markdown files - merge to master Updated the syntax for markdown files. --- h2o-docs/StyleGuide.md | 16 +-- h2o-docs/src/product/flow.rst | 2 +- h2o-docs/src/product/flow/README.md | 99 +++++++------- h2o-docs/src/product/flow/SiteIntro.md | 36 ++--- .../howto/Connecting_RStudio_to_Sparkling_Water.md | 2 +- h2o-docs/src/product/howto/DemosAndTests.md | 22 ++-- h2o-docs/src/product/howto/FAQ.md | 36 ++--- h2o-docs/src/product/howto/H2O-DevCmdLine.md | 10 +- h2o-docs/src/product/howto/H2O-DevDocker.md | 4 +- h2o-docs/src/product/howto/H2O-DevHadoop.md | 10 +- h2o-docs/src/product/howto/H2O-DevLogs.md | 41 +++--- h2o-docs/src/product/howto/H2O-DevS3Creds.md | 57 ++++---- h2o-docs/src/product/howto/LDAP.md | 10 +- h2o-docs/src/product/howto/MOJO_QuickStart.md | 146 ++++++++++----------- h2o-docs/src/product/howto/Videos.md | 12 +- h2o-docs/src/product/howto/YARN_BP.md | 18 +-- h2o-docs/src/product/tutorials/GainsLift.md | 20 +-- h2o-docs/src/product/tutorials/GridSearch.md | 33 ++--- h2o-docs/src/product/tutorials/Interactions.md | 4 +- .../tutorials/datascience/DataScienceH2O-Dev.md | 117 +++++++++-------- h2o-docs/src/product/tutorials/dl/dl.md | 56 ++++---- h2o-docs/src/product/tutorials/gbm/gbm.md | 46 +++---- h2o-docs/src/product/tutorials/glm/glm.md | 44 +++---- h2o-docs/src/product/tutorials/glossary.md | 2 +- h2o-docs/src/product/tutorials/kmeans/kmeans.md | 40 +++--- h2o-docs/src/product/tutorials/pca/pca.md | 36 ++--- h2o-docs/src/product/tutorials/rf/rf.md | 47 +++---- h2o-docs/src/product/upgrade/H2OBenefits.md | 43 +++--- .../src/product/upgrade/H2ODevPortingRScripts.md | 74 +++++------ h2o-docs/src/product/upgrade/JavaChanges.md | 10 +- h2o-docs/src/product/upgrade/Migration.md | 121 ++++++++--------- h2o-docs/src/product/upgrade/PressRelease.md | 2 +- h2o-docs/src/product/upgrade/PythonParity.md | 4 +- h2o-docs/src/product/upgrade/RChanges.md | 12 +- h2o-docs/src/product/upgrade/Rdoc.md | 94 ++++++------- h2o-docs/src/product/upgrade/Upgrade.md | 27 ++-- 36 files changed, 699 insertions(+), 654 deletions(-) diff --git a/h2o-docs/StyleGuide.md b/h2o-docs/StyleGuide.md index 7091fe12e23..e8edce60738 100644 --- a/h2o-docs/StyleGuide.md +++ b/h2o-docs/StyleGuide.md @@ -1,9 +1,9 @@ -#Style Guide +# Style Guide -##Capitalization +## Capitalization -###Mixed +### Mixed (always capitalize as below) @@ -20,7 +20,7 @@ - PyPI - RUnit -###Initial Caps +### Initial Caps (always capitalize as below) @@ -45,7 +45,7 @@ - Tweedie -###Always All Caps +### Always All Caps For any acronym, spell out in 1st use, then provide acronym in parentheses after; for pluralization, add a lowercase "s" (POJO -> POJOs, API -> APIs) @@ -82,7 +82,7 @@ For any acronym, spell out in 1st use, then provide acronym in parentheses after - RDD (Resilient Distributed Dataset) - YARN (Yet Another Resource Negotiator) -###Always Lowercase +### Always Lowercase (unless starting a sentence or in heading) @@ -94,7 +94,7 @@ For any acronym, spell out in 1st use, then provide acronym in parentheses after - tahn -##Algos +## Algos - Naïve Bayes LATIN SMALL LETTER I WITH DIAERESIS @@ -104,6 +104,6 @@ Unicode: U+00EF, UTF-8: C3 AF - K-means -#One word (not two) +# One word (not two) - Dataset \ No newline at end of file diff --git a/h2o-docs/src/product/flow.rst b/h2o-docs/src/product/flow.rst index 10c1f8f21ab..8b0a42b58fa 100644 --- a/h2o-docs/src/product/flow.rst +++ b/h2o-docs/src/product/flow.rst @@ -1420,7 +1420,7 @@ You can also view predictions by clicking the drop-down **Score** menu and selecting **List All Predictions**. -Intepreting the Gains/Lift Chart +Interpreting the Gains/Lift Chart ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The Gains/Lift chart evaluates the prediction ability of a binary classification model. The chart is computed using the prediction probability and the true response (class) labels. The accuracy of the classification model for a random sample is evaluated according to the results when the model is and is not used. diff --git a/h2o-docs/src/product/flow/README.md b/h2o-docs/src/product/flow/README.md index a84a51202e7..a6a5f3c6e6f 100644 --- a/h2o-docs/src/product/flow/README.md +++ b/h2o-docs/src/product/flow/README.md @@ -1,4 +1,6 @@ -#Flow Web UI ... +# Flow Web UI ... + +>**Note**: This topic is no longer being maintained. Refer to the [Using Flow - H2O's Web UI](https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/flow.rst) topic for the most up-to-date documentation. H2O Flow is an open-source user interface for H2O. It is a web-based interactive environment that allows you to combine code execution, text, mathematics, plots, and rich media in a single document. @@ -10,7 +12,7 @@ H2O Flow sends commands to H2O as a sequence of executable cells. The cells can While H2O Flow supports REST API, R scripts, and CoffeeScript, no programming experience is required to run H2O Flow. You can click your way through any H2O operation without ever writing a single line of code. You can even disable the input cells to run H2O Flow using only the GUI. H2O Flow is designed to guide you every step of the way, by providing input prompts, interactive help, and example flows. -##Introduction +## Introduction This guide will walk you through how to use H2O's web UI, H2O Flow. To view a demo video of H2O Flow, click here. @@ -72,14 +74,14 @@ Before getting started with H2O Flow, make sure you understand the different cel There are two modes for cells: edit and command. -###Using Edit Mode +### Using Edit Mode In edit mode, the cell is yellow with a blinking bar to indicate where text can be entered and there is an orange flag to the left of the cell. ![Edit Mode](images/Flow_EditMode.png) -###Using Command Mode +### Using Command Mode In command mode, the flag is yellow. The flag also indicates the cell's format: - **MD**: Markdown @@ -113,7 +115,7 @@ In edit mode, the cell is yellow with a blinking bar to indicate where text can ![Cell executing](images/Flow_cellmode_runningflag.png) -###Changing Cell Formats +### Changing Cell Formats To change the cell's format (for example, from code to Markdown), make sure you are in command (not edit) mode and that the cell you want to change is selected. The easiest way to do this is to click on the flag to the left of the cell. Enter the keyboard shortcut for the format you want to use. The flag's text changes to display the current format. @@ -130,7 +132,7 @@ Heading 5 | `5` Heading 6 | `6` -###Running Cells +### Running Cells The series of buttons at the top of the page below the menus run cells in a flow. @@ -149,7 +151,7 @@ The series of buttons at the top of the page below the menus run cells in a flow -###Running Flows +### Running Flows When you run the flow, a progress bar indicates the current status of the flow. You can cancel the currently running flow by clicking the **Stop** button in the progress bar. ![Flow Progress Bar](images/Flow_progressbar.png) @@ -162,7 +164,7 @@ When the flow is complete, a message displays in the upper right. >**Note**: If there is an error in the flow, H2O Flow stops at the cell that contains the error. -###Using Keyboard Shortcuts +### Using Keyboard Shortcuts Here are some important keyboard shortcuts to remember: @@ -178,20 +180,20 @@ The following commands must be entered in [command mode](#CmdMode). You can view these shortcuts by clicking **Help** > **Keyboard Shortcuts** or by clicking the **Help** tab in the sidebar. -###Using Variables in Cells +### Using Variables in Cells Variables can be used to store information such as download locations. To use a variable in Flow: -0. Define the variable in a code cell (for example, `locA = "https://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/kdd2009/small-churn/kdd_train.csv"`). +1. Define the variable in a code cell (for example, `locA = "https://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/kdd2009/small-churn/kdd_train.csv"`). ![Flow variable definition](images/Flow_VariableDefinition.png) -0. Run the cell. H2O validates the variable. +2. Run the cell. H2O validates the variable. ![Flow variable validation](images/Flow_VariableValidation.png) -0. Use the variable in another code cell (for example, `importFiles [locA]`). +3. Use the variable in another code cell (for example, `importFiles [locA]`). ![Flow variable example](images/Flow_VariableExample.png) To further simplify your workflow, you can save the cells containing the variables and definitions as [clips](#Clips). -###Using Flow Buttons +### Using Flow Buttons There are also a series of buttons at the top of the page below the flow name that allow you to save the current flow, add a new cell, move cells up or down, run the current cell, and cut, copy, or paste the current cell. If you hover over the button, a description of the button's function displays. ![Flow buttons](images/Flow_buttons.png) @@ -224,6 +226,7 @@ There are multiple ways to import data in H2O flow: After selecting the file to import, the file path displays in the "Search Results" section. To import a single file, click the plus sign next to the file. To import all files in the search results, click the **Add all** link. The files selected for import display in the "Selected Files" section. ![Import Files](images/Flow_import.png) + >**Note**: If the file is compressed, it will only be read using a single thread. For best performance, we recommend uncompressing the file before importing, as this will allow use of the faster multithreaded distributed parallel reader during import. Please note that .zip files containing multiple files are not currently supported. @@ -238,7 +241,7 @@ After you click the **Import** button, the raw code for the current job displays ![Import Files - Results](images/Flow_import_results.png) -##Uploading Data +## Uploading Data To upload a local file, click the **Data** menu and select **Upload File...**. Click the **Choose File** button, select the file, click the **Choose** button, then click the **Upload** button. @@ -254,7 +257,7 @@ Ok, now that your data is available in H2O Flow, let's move on to the next step: --- -##Parsing Data +## Parsing Data After you have imported your data, parse the data. @@ -686,35 +689,35 @@ To generate a Plain Old Java Object (POJO) that can use the model outside of H2O --- -###Exporting and Importing Models +### Exporting and Importing Models **To export a built model:** -0. Click the **Model** menu at the top of the screen. -0. Select *Export Model...* -0. In the `exportModel` cell that appears, select the model from the drop-down *Model:* list. -0. Enter a location for the exported model in the *Path:* entry field. +1. Click the **Model** menu at the top of the screen. +2. Select *Export Model...* +3. In the `exportModel` cell that appears, select the model from the drop-down *Model:* list. +4. Enter a location for the exported model in the *Path:* entry field. >**Note**: If you specify a location that doesn't exist, it will be created. For example, if you only enter `test` in the *Path:* entry field, the model will be exported to `h2o-3/test`. -0. To overwrite any files with the same name, check the *Overwrite:* checkbox. -0. Click the **Export** button. A confirmation message displays when the model has been successfully exported. +5. To overwrite any files with the same name, check the *Overwrite:* checkbox. +6. Click the **Export** button. A confirmation message displays when the model has been successfully exported. ![Export Model](images/ExportModel.png) **To import a built model:** -0. Click the **Model** menu at the top of the screen. -0. Select *Import Model...* -0. Enter the location of the model in the *Path:* entry field. +1. Click the **Model** menu at the top of the screen. +2. Select *Import Model...* +3. Enter the location of the model in the *Path:* entry field. >**Note**: The file path must be complete (e.g., `Users/h2o-user/h2o-3/exported_models`). Do not rename models while importing. -0. To overwrite any files with the same name, check the *Overwrite:* checkbox. -0. Click the **Import** button. A confirmation message displays when the model has been successfully imported. To view the imported model, click the **View Model** button. +4. To overwrite any files with the same name, check the *Overwrite:* checkbox. +5. Click the **Import** button. A confirmation message displays when the model has been successfully imported. To view the imported model, click the **View Model** button. ![Import Model](images/ImportModel.png) --- -###Using Grid Search +### Using Grid Search To include a parameter in a grid search in Flow, check the checkbox in the *Grid?* column to the right of the parameter name (highlighted in red in the image below). @@ -730,7 +733,7 @@ To include a parameter in a grid search in Flow, check the checkbox in the *Grid --- -###Checkpointing Models +### Checkpointing Models Some model types, such as DRF, GBM, and Deep Learning, support checkpointing. A checkpoint resumes model training so that you can iterate your model. The dataset must be the same. The following model parameters must be the same when restarting a model from a checkpoint: @@ -766,16 +769,16 @@ Can be modified | | | `elastic_averaging_moving_rate`| `elastic_averaging_regularization`| `mini_batch_size` -0. After building your model, copy the `model_id`. To view the `model_id`, click the **Model** menu then click **List All Models**. -0. Select the model type from the drop-down **Model** menu. +1. After building your model, copy the `model_id`. To view the `model_id`, click the **Model** menu then click **List All Models**. +2. Select the model type from the drop-down **Model** menu. >**Note**: The model type must be the same as the checkpointed model. -0. Paste the copied `model_id` in the *checkpoint* entry field. -0. Click the **Build Model** button. The model will resume training. +3. Paste the copied `model_id` in the *checkpoint* entry field. +4. Click the **Build Model** button. The model will resume training. --- -###Interpreting Model Results +### Interpreting Model Results **Scoring history**: [GBM](#GBM), [DL](#DL) Represents the error rate of the model as it is built. Typically, the error rate will be higher at the beginning (the left side of the graph) then decrease as the model building completes and accuracy improves. Can include mean squared error (MSE) and deviance. @@ -898,16 +901,16 @@ Datasets can be split within Flow for use in model training and testing. ![splitFrame cell](images/Flow_splitFrame.png) -0. To split a frame, click the **Assist Me** button, then click **splitFrame**. +1. To split a frame, click the **Assist Me** button, then click **splitFrame**. >**Note**: You can also click the drop-down **Data** menu and select **Split Frame...**. -0. From the drop-down **Frame:** list, select the frame to split. -0. In the second **Ratio** entry field, specify the fractional value to determine the split. The first **Ratio** field is automatically calculated based on the values entered in the second **Ratio** field. +2. From the drop-down **Frame:** list, select the frame to split. +3. In the second **Ratio** entry field, specify the fractional value to determine the split. The first **Ratio** field is automatically calculated based on the values entered in the second **Ratio** field. >**Note**: Only fractional values between 0 and 1 are supported (for example, enter `.5` to split the frame in half). The total sum of the ratio values must equal one. H2O automatically adjusts the ratio values to equal one; if unsupported values are entered, an error displays. -0. In the **Key** entry field, specify a name for the new frame. -0. (Optional) To add another split, click the **Add a split** link. To remove a split, click the `X` to the right of the **Key** entry field. -0. Click the **Create** button. +4. In the **Key** entry field, specify a name for the new frame. +5. (Optional) To add another split, click the **Add a split** link. To remove a split, click the `X` to the right of the **Key** entry field. +6. Click the **Create** button. --- ### Creating Frames @@ -1131,7 +1134,7 @@ To view the stack trace information for a specific node, select it from the drop --- -##Viewing Network Test Results +## Viewing Network Test Results To view network test results, click the **Admin** menu, then click **Network Test**. @@ -1169,18 +1172,18 @@ To obtain the most recent information, click the **Refresh** button. --- -##Reporting Issues +## Reporting Issues If you experience an error with Flow, you can submit a JIRA ticket to notify our team. -0. First, click the **Admin** menu, then click **Download Logs**. This will download a file contains information that will help our developers identify the cause of the issue. -0. Click the **Help** menu, then click **Report an issue**. This will open our JIRA page where you can file your ticket. -0. Click the **Create** button at the top of the JIRA page. -0. Attach the log file from the first step, write a description of the error you experienced, then click the **Create** button at the bottom of the page. Our team will work to resolve the issue and you can track the progress of your ticket in JIRA. +1. First, click the **Admin** menu, then click **Download Logs**. This will download a file contains information that will help our developers identify the cause of the issue. +2. Click the **Help** menu, then click **Report an issue**. This will open our JIRA page where you can file your ticket. +3. Click the **Create** button at the top of the JIRA page. +4. Attach the log file from the first step, write a description of the error you experienced, then click the **Create** button at the bottom of the page. Our team will work to resolve the issue and you can track the progress of your ticket in JIRA. --- -##Requesting Help +## Requesting Help If you have a Google account, you can submit a request for assistance with H2O on our Google Groups page, [H2Ostream](https://groups.google.com/forum/#!forum/h2ostream). @@ -1194,6 +1197,8 @@ To access H2Ostream from Flow: You can also email your question to [h2ostream@googlegroups.com](mailto:h2ostream@googlegroups.com). +Or, you can post your question on [Stack Overflow](https://stackoverflow.com/questions/tagged/h2o) using the "h2o" tag. + --- diff --git a/h2o-docs/src/product/flow/SiteIntro.md b/h2o-docs/src/product/flow/SiteIntro.md index 7f4be5588a6..874ad4e0d1b 100644 --- a/h2o-docs/src/product/flow/SiteIntro.md +++ b/h2o-docs/src/product/flow/SiteIntro.md @@ -1,4 +1,6 @@ -#Welcome to H2O 3.0 +# Welcome to H2O 3.0 + +>**Note**: This topic is no longer being maintained. Refer to this [Welcome](https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/welcome.rst) topic for the most up-to-date documentation. Welcome to the H2O documentation site! Depending on your area of interest, select a learning path from the links above. @@ -18,7 +20,7 @@ Type your question in the entry field that appears at the bottom of the sidebar --- -##New Users +## New Users If you're just getting started with H2O, here are some links to help you learn more: @@ -52,7 +54,7 @@ If you're just getting started with H2O, here are some links to help you learn m --- -##Experienced Users +## Experienced Users If you've used previous versions of H2O, the following links will help guide you through the process of upgrading to H2O 3.0. @@ -74,7 +76,7 @@ If you've used previous versions of H2O, the following links will help guide you --- -##Enterprise Users +## Enterprise Users If you're considering using H2O in an enterprise environment, you'll be happy to know that the H2O platform is supported on all major Hadoop distributions including Cloudera Enterprise, Hortonworks Data Platform and the MapR Apache Hadoop Distribution. @@ -99,7 +101,7 @@ For additional sales or marketing assistance, please email [sales@h2o.ai](mailto --- -##Sparkling Water Users +## Sparkling Water Users Sparkling Water is a gradle project with the following submodules: @@ -121,7 +123,7 @@ Sparkling Water is versioned according to the Spark versioning, so make sure to - Use [Sparkling Water 1.5](http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.5/3/index.html) for Spark 1.5 -###Getting Started with Sparkling Water +### Getting Started with Sparkling Water - Download Sparkling Water: Go here to download Sparkling Water. @@ -143,7 +145,7 @@ Sparkling Water is versioned according to the Spark versioning, so make sure to - Connecting RStudio to Sparkling Water: This illustrated tutorial describes how to use RStudio to connect to Sparkling Water. -###Sparkling Water Blog Posts +### Sparkling Water Blog Posts - How Sparkling Water Brings H2O to Spark @@ -151,7 +153,7 @@ Sparkling Water is versioned according to the Spark versioning, so make sure to - In-memory Big Data: Spark + H2O -###Sparkling Water Meetup Slide Decks +### Sparkling Water Meetup Slide Decks - Sparkling Water Meetup 02/03/2015 @@ -162,7 +164,7 @@ Sparkling Water is versioned according to the Spark versioning, so make sure to - Sparkling Water Hands-On -###PySparkling +### PySparkling >*Note*: PySparkling requires [Sparkling Water 1.5](http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.5/3/index.html) or later. @@ -178,7 +180,7 @@ To install H2O's PySparkling package, use the egg file included in the distribut --- -##Python Users +## Python Users Pythonistas will be glad to know that H2O now provides support for this popular programming language. Python users can also use H2O with IPython notebooks. For more information, refer to the following links. @@ -198,7 +200,7 @@ Pythonistas will be glad to know that H2O now provides support for this popular --- -##R Users +## R Users Don't worry, R users - we still provide R support in the latest version of H2O, just as before. The R components of H2O have been cleaned up, simplified, and standardized, so the command format is easier and more intuitive. Due to these improvements, be aware that any scripts created with previous versions of H2O will need some revision to be compatible with the latest version. @@ -217,7 +219,7 @@ To check which version of H2O is installed in R, use `versions::installed.versio - Connecting RStudio to Sparkling Water: This illustrated tutorial describes how to use RStudio to connect to Sparkling Water. -###Ensembles +### Ensembles Ensemble machine learning methods use multiple learning algorithms to obtain better predictive performance. @@ -228,7 +230,7 @@ Ensemble machine learning methods use multiple learning algorithms to obtain bet --- -##API Users +## API Users API users will be happy to know that the APIs have been more thoroughly documented in the latest release of H2O and additional capabilities (such as exporting weights and biases for Deep Learning models) have been added. @@ -246,7 +248,7 @@ REST APIs are generated immediately out of the code, allowing users to implement --- -##Java Users +## Java Users For Java developers, the following resources will help you create your own custom app that uses H2O. @@ -256,7 +258,7 @@ For Java developers, the following resources will help you create your own custo - h2o-genmodel (POJO) Javadoc: Provides a step-by-step guide to creating and implementing POJOs in a Java application. -###SDK Information +### SDK Information The Java API is generated and accessible from the [download page](http://h2o.ai/download). @@ -267,7 +269,7 @@ The Java API is generated and accessible from the [download page](http://h2o.ai/ --- -##Developers +## Developers If you're looking to use H2O to help you develop your own apps, the following links will provide helpful references. @@ -309,7 +311,7 @@ After starting multiple "worker" node processes in addition to the JUnit test pr - Contributing code: If you're interested in contributing code to H2O, we appreciate your assistance! This document describes how to access our list of Jiras that contributors can work on and how to contact us. **Note**: To access this link, you must have an [Atlassian account](https://id.atlassian.com/signup?application=mac&tenant=&continue=https%3A%2F%2Fmy.atlassian.com). --- -#Downloading H2O +# Downloading H2O * [Download page for this build](http://h2o-release.s3.amazonaws.com/h2o/{{branch_name}}/{{build_number}}/index.html) * [h2o.ai main download page](http://www.h2o.ai/download) diff --git a/h2o-docs/src/product/howto/Connecting_RStudio_to_Sparkling_Water.md b/h2o-docs/src/product/howto/Connecting_RStudio_to_Sparkling_Water.md index 71abd358ad2..82bf32e96ca 100644 --- a/h2o-docs/src/product/howto/Connecting_RStudio_to_Sparkling_Water.md +++ b/h2o-docs/src/product/howto/Connecting_RStudio_to_Sparkling_Water.md @@ -1,4 +1,4 @@ -#Connecting R Studio to Sparkling Water +# Connecting R Studio to Sparkling Water If you have connected to H2O from [RStudio](http://h2o-release.s3.amazonaws.com/h2o/{{branch_name}}/{{build_number}}/index.html#R) before, the process for connecting to Sparkling Water from RStudio is very similar. diff --git a/h2o-docs/src/product/howto/DemosAndTests.md b/h2o-docs/src/product/howto/DemosAndTests.md index da2e7d7cc59..e59cc7608f2 100644 --- a/h2o-docs/src/product/howto/DemosAndTests.md +++ b/h2o-docs/src/product/howto/DemosAndTests.md @@ -1,4 +1,4 @@ -#Running Demos and Tests +# Running Demos and Tests H2O provides demos and tests in R, Python, Flow, Scala and Java for our algorithms. @@ -6,35 +6,35 @@ Demos contain example workflows showing how typical end users make use of H2O's Tests exercise every capability of H2O in detail using appropriate datasets and parameters and automatically verify that the expected results are produced. -##Demos +## Demos -###R +### R - [Kaggle](https://github.com/h2oai/h2o-3/tree/master/h2o-r/demos/kaggle): Contains Kaggle demos, including "Beating the Benchmark" and "Will It Rain?" - [Supervised Demo](https://github.com/h2oai/h2o-3/blob/master/h2o-r/demos/large/supervised.R): Runs four algorithms on categorical or continuous response datasets and reports performance. -###Python +### Python - [Python Demos](https://github.com/h2oai/h2o-3/tree/master/h2o-py/demos): Contains a library of Python demos and instructions on how to run the demos. -###Flow +### Flow - [Flow Demos](https://github.com/h2oai/h2o-3/tree/master/h2o-docs/src/product/flow/packs/examples): Contains a library of demos that can be run in H2O's web UI, Flow. These demos can also be accessed within Flow by clicking the "Help" sidebar, then clicking "Browse installed packs...", then clicking the "Examples" folder and selecting the demo flow. -###Scala +### Scala - [Scala Demos](https://github.com/h2oai/sparkling-water/tree/master/examples/scripts): Contains Scala demos used at meetups to demonstrate Sparkling Water. -###Java +### Java >Need location -##Tests +## Tests -###R +### R - [Instructions](https://github.com/h2oai/h2o-3/tree/master/h2o-r): Instructions on running R tests. @@ -66,7 +66,7 @@ Tests exercise every capability of H2O in detail using appropriate datasets and -###Python +### Python - [Instructions](https://github.com/h2oai/h2o-3/tree/master/h2o-py): Instructions for running Python tests. @@ -80,7 +80,7 @@ Tests exercise every capability of H2O in detail using appropriate datasets and - [DRF](https://github.com/h2oai/h2o-3/tree/master/h2o-py/tests/testdir_algos/rf): Library of DRF Python tests. -###Java +### Java - [Instructions](https://github.com/h2oai/h2o-3/blob/master/h2o-core/testMultiNode.sh): Instructions on running Java tests. diff --git a/h2o-docs/src/product/howto/FAQ.md b/h2o-docs/src/product/howto/FAQ.md index 886bbf7cf03..4a12e41cff9 100644 --- a/h2o-docs/src/product/howto/FAQ.md +++ b/h2o-docs/src/product/howto/FAQ.md @@ -1,6 +1,8 @@ -#FAQ +# FAQ -##General Troubleshooting Tips +>**Note**: This topic is no longer being maintained. Refer to individual topics within the [FAQ](https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/faq) folder for the most up-to-date version of the H2O FAQ. + +## General Troubleshooting Tips - Confirm your internet connection is active. @@ -64,7 +66,7 @@ This error output indicates that your Java version is not supported. Upgrade to --- -##Algorithms +## Algorithms **What does it mean if the r2 value in my model is negative?** @@ -187,10 +189,10 @@ https://github.com/h2oai/sparkling-water/blob/master/examples/scripts/craigslist Here is an example of how the prediction process works in H2O: -0. Train a model using data that has a categorical predictor column with levels B,C, and D (no other levels); this level will be the "training set domain": {B,C,D} -0. During scoring, the test set has only rows with levels A,C, and E for that column; this is the "test set domain": {A,C,E} -0. For scoring, a combined "scoring domain" is created, which is the training domain appended with the extra test set domain entries: {B,C,D,A,E} -0. Each model can handle these extra levels {A,E} separately during scoring. +1. Train a model using data that has a categorical predictor column with levels B,C, and D (no other levels); this level will be the "training set domain": {B,C,D} +2. During scoring, the test set has only rows with levels A,C, and E for that column; this is the "test set domain": {A,C,E} +3. For scoring, a combined "scoring domain" is created, which is the training domain appended with the extra test set domain entries: {B,C,D,A,E} +4. Each model can handle these extra levels {A,E} separately during scoring. The behavior for unseen categorical levels depends on the algorithm and how it handles missing levels (NA values): @@ -228,7 +230,7 @@ To convert the response column: --- -##Building H2O +## Building H2O **During the build process, the following error message displays. What do I need to do to resolve it?** @@ -261,7 +263,7 @@ Try using `./gradlew build -x test` - the build may be failing tests if data is --- -##Clusters +## Clusters **When trying to launch H2O, I received the following error message: `ERROR: Too many retries starting cloud.` What should I do?** @@ -375,7 +377,7 @@ The following information displays for each message: --- -##Data +## Data **How should I format my SVMLight data before importing?** @@ -424,7 +426,7 @@ Parsing Gzip files is not done in parallel, so it is sequential and uses only on --- -##General +## General **How do I score using an exported JSON model?** @@ -761,7 +763,7 @@ Do Nothing and All Is Well. --- -##Hadoop +## Hadoop **Why did I get an error in R when I tried to save my model to my home directory in Hadoop?** @@ -865,7 +867,7 @@ Then run the command to launch the H2O Application in the driver by specifying t --- -##Java +## Java **How do I use H2O with Java?** @@ -936,7 +938,7 @@ EOF --- -##Python +## Python **I tried to install H2O in Python but `pip install scikit-learn` failed - what should I do?** @@ -1165,7 +1167,7 @@ Yes, a notebook is available [here](https://github.com/h2oai/h2o-3/blob/master/h --- -##R +## R **Which versions of R are compatible with H2O?** @@ -1583,7 +1585,7 @@ new_fr --- -##Sparkling Water +## Sparkling Water **What are the advantages of using Sparkling Water compared with H2O?** @@ -1800,7 +1802,7 @@ After setting up `H2OContext`, try to run Sparkling Water again. --- -##Tunneling between servers with H2O +## Tunneling between servers with H2O To tunnel between servers (for example, due to firewalls): diff --git a/h2o-docs/src/product/howto/H2O-DevCmdLine.md b/h2o-docs/src/product/howto/H2O-DevCmdLine.md index 2607546a29f..9bebdf029c0 100644 --- a/h2o-docs/src/product/howto/H2O-DevCmdLine.md +++ b/h2o-docs/src/product/howto/H2O-DevCmdLine.md @@ -1,5 +1,7 @@ # ... From the Cmd Line +>**Note**: This topic is no longer being maintained. Refer to [Starting H2O from the Command Line](https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/starting-h2o.rst#from-the-command-line) for the most up-to-date documentation. + You can use Terminal (OS X) or the Command Prompt (Windows) to launch H2O 3.0. When you launch from the command line, you can include additional instructions to H2O 3.0, such as how many nodes to launch, how much memory to allocate for each node, assign names to the nodes in the cloud, and more. >**Note**: H2O requires some space in the `/tmp` directory to launch. If you cannot launch H2O, try freeing up some space in the `/tmp` directory, then try launching H2O again. @@ -13,7 +15,7 @@ There are two different argument types: The arguments use the following format: java `` -jar h2o.jar ``. -##JVM Options +## JVM Options - `-version`: Display Java version info. - `-Xmx`: To set the total heap size for an H2O node, configure the memory allocation option `-Xmx`. By default, this option is set to 1 Gb (`-Xmx1g`). When launching nodes, we recommend allocating a total of four times the memory of your data. @@ -21,7 +23,7 @@ The arguments use the following format: java `` -jar h2o.jar ` **Note**: Do not try to launch H2O with more memory than you have available. -##H2O Options +## H2O Options - `-h` or `-help`: Display this information in the command line output. - `-name `: Assign a name to the H2O instance in the cloud (where `` is the name of the cloud). Nodes with the same cloud name will form an H2O cloud (also known as an H2O cluster). @@ -40,7 +42,7 @@ The arguments use the following format: java `` -jar h2o.jar `**Note**: This topic is no longer being maintained. Refer to [Docker Users](https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/welcome.rst#docker-users) for the most up-to-date documentation. + This walkthrough describes: * Installing Docker on Mac or Linux OS @@ -24,7 +26,7 @@ This walkthrough describes: * Using `User` directory (not `root`) -##Notes +## Notes - Older Linux kernel versions are known to cause kernel panics that break Docker; there are ways around it, but these should be attempted at your own risk. To check the version of your kernel, run `uname -r` at the command prompt. The following walkthrough has been tested on a Mac OS X 10.10.1. - The Dockerfile always pulls the latest H2O release. diff --git a/h2o-docs/src/product/howto/H2O-DevHadoop.md b/h2o-docs/src/product/howto/H2O-DevHadoop.md index 850a556a627..184570cd1ed 100644 --- a/h2o-docs/src/product/howto/H2O-DevHadoop.md +++ b/h2o-docs/src/product/howto/H2O-DevHadoop.md @@ -1,5 +1,7 @@ # ... On Hadoop +>**Note**: This topic is no longer being maintained. Refer to [Hadoop Users](https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/welcome.rst#hadoop-users) for the most up-to-date documentation. + Currently supported versions: - CDH 5.2 @@ -54,14 +56,14 @@ Tutorial The following tutorial will walk the user through the download or build of H2O and the parameters involved in launching H2O from the command line. -0. Download the latest H2O release for your version of Hadoop. Refer to the H2O on Hadoop Download page. +1. Download the latest H2O release for your version of Hadoop. Refer to the H2O on Hadoop Download page. -0. Prepare the job input on the Hadoop Node by unzipping the build file and changing to the directory with the Hadoop and H2O's driver jar files. +2. Prepare the job input on the Hadoop Node by unzipping the build file and changing to the directory with the Hadoop and H2O's driver jar files. unzip h2o-{{project_version}}-*.zip cd h2o-{{project_version}}-* -0. To launch H2O nodes and form a cluster on the Hadoop cluster, run: +3. To launch H2O nodes and form a cluster on the Hadoop cluster, run: hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g -output hdfsOutputDirName @@ -73,7 +75,7 @@ The following tutorial will walk the user through the download or build of H2O a - *output* is the name of the directory created each time a H2O cloud is created so it is necessary for the name to be unique each time it is launched. -0. To monitor your job, direct your web browser to your standard job tracker Web UI. +4. To monitor your job, direct your web browser to your standard job tracker Web UI. To access H2O's Web UI, direct your web browser to one of the launched instances. If you are unsure where your JVM is launched, review the output from your command after the nodes has clouded up and formed a cluster. Any of the nodes' IP addresses will work as there is no master node. diff --git a/h2o-docs/src/product/howto/H2O-DevLogs.md b/h2o-docs/src/product/howto/H2O-DevLogs.md index cadbb2adcb9..e855fb19223 100644 --- a/h2o-docs/src/product/howto/H2O-DevLogs.md +++ b/h2o-docs/src/product/howto/H2O-DevLogs.md @@ -1,13 +1,14 @@ -#Downloading Logs +# Downloading Logs +>**Note**: This topic is no longer being maintained. Refer to [Downloading Logs](https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/logs.rst) for the most up-to-date documentation. -##Accessing Logs +## Accessing Logs Depending on whether you are using Hadoop with H2O and whether the job is currently running, there are different ways of obtaining the logs for H2O. Copy and email the logs to support@h2o.ai or submit them to h2ostream@googlegroups.com with a brief description of your Hadoop environment, including the Hadoop distribution and version. -###Without Running Jobs +### Without Running Jobs - If you are using Hadoop and the job is not running, view the logs by using the `yarn logs -applicationId` command. When you start an H2O instance, the complete command displays in the output: @@ -41,40 +42,40 @@ In the above example, the command is specified in the next to last line (`For YA Use YARN to obtain the `stdout` and `stderr` logs that are used for troubleshooting. To learn how to access YARN based on management software, version, and job status, see [Accessing YARN](#AccessYARN). -0. Click the **Applications** link to view all jobs, then click the **History** link for the job. +1. Click the **Applications** link to view all jobs, then click the **History** link for the job. ![YARN - History](images/YARN_AllApps_History.png) -0. Click the **logs** link. +2. Click the **logs** link. ![YARN - History](images/YARN_History_Logs.png) -0. Copy the information that displays and send it in an email to support@h2o.ai. +3. Copy the information that displays and send it in an email to support@h2o.ai. ![YARN - History](images/YARN_History_Logs2.png) --- -###With Running Jobs +### With Running Jobs If you are using Hadoop and the job is still running: - Use YARN to obtain the `stdout` and `stderr` logs that are used for troubleshooting. To learn how to access YARN based on management software, version, and job status, see [Accessing YARN](#AccessYARN). - 0. Click the **Applications** link to view all jobs, then click the **ApplicationMaster** link for the job. + 1. Click the **Applications** link to view all jobs, then click the **ApplicationMaster** link for the job. ![YARN - Application Master](images/YARN_AllApps_AppMaster.png) - 0. Select the job from the list of active jobs. + 2. Select the job from the list of active jobs. ![YARN - Application Master](images/YARN_AppMaster_Job.png) - 0. Click the **logs** link. + 3. Click the **logs** link. ![YARN - Application Master](images/YARN_AppMaster_Logs.png) - 0. Send the contents of the displayed files to support@h2o.ai. + 4. Send the contents of the displayed files to support@h2o.ai. ![YARN - Application Master](images/YARN_AppMaster_Logs2.png) @@ -127,14 +128,14 @@ If you are using Hadoop and the job is still running: --- -##Accessing YARN +## Accessing YARN Methods for accessing YARN vary depending on the default management software and version, as well as job status. --- -###Cloudera 5 & 5.2 +### Cloudera 5 & 5.2 1. In Cloudera Manager, click the **YARN** link in the cluster section. @@ -147,7 +148,7 @@ Methods for accessing YARN vary depending on the default management software and --- -###Ambari +### Ambari 1. From the Ambari Dashboard, select **YARN**. @@ -160,9 +161,9 @@ Methods for accessing YARN vary depending on the default management software and --- -##For Non-Hadoop Users +## For Non-Hadoop Users -###Without Current Jobs +### Without Current Jobs If you are not using Hadoop and the job is not running: @@ -172,7 +173,7 @@ If you are not using Hadoop and the job is not running: --- -###With Current Jobs +### With Current Jobs If you are not using Hadoop and the job is still running: @@ -223,16 +224,16 @@ If you are not using Hadoop and the job is still running: - To view the REST API logs from R: - 0. In R, enter `h2o.startLogging()`. The output displays the location of the REST API logs: + 1. In R, enter `h2o.startLogging()`. The output displays the location of the REST API logs: ``` > h2o.startLogging() Appending REST API transactions to log file /var/folders/ylcq5nhky53hjcl9wrqxt39kz80000gn/T//RtmpE7X8Yv/rest.log ``` - 0. Copy the displayed file path. + 2. Copy the displayed file path. In Terminal, enter `less` and paste the file path. - 0. Press Enter. A time-stamped log of all REST API transactions displays. + 3. Press Enter. A time-stamped log of all REST API transactions displays. ``` ------------------------------------------------------------ diff --git a/h2o-docs/src/product/howto/H2O-DevS3Creds.md b/h2o-docs/src/product/howto/H2O-DevS3Creds.md index fefc0b8a364..a273e2f10fe 100644 --- a/h2o-docs/src/product/howto/H2O-DevS3Creds.md +++ b/h2o-docs/src/product/howto/H2O-DevS3Creds.md @@ -1,5 +1,7 @@ # On EC2 and S3 +>**Note**: This topic is no longer being maintained. Refer to the individual topics in the [Cloud Integration](https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/cloud-integration) folder for the most up-to-date documentation. + ## On EC2 >Tested on Redhat AMI, Amazon Linux AMI, and Ubuntu AMI @@ -8,13 +10,13 @@ To use the Amazon Web Services (AWS) S3 storage solution, you will need to pass For security reasons, we recommend writing a script to read the access credentials that are stored in a separate file. This will not only keep your credentials from propagating to other locations, but it will also make it easier to change the credential information later. -##Standalone Instance +## Standalone Instance When running H2O in standalone mode using the simple Java launch command, we can pass in the S3 credentials in two ways. - You can pass in credentials in standalone mode by creating a `core-site.xml` file and pass it in with the flag `-hdfs_config`. For an example `core-site.xml` file, refer to [Core-site.xml](#Example). - 0. Edit the properties in the core-site.xml file to include your Access Key ID and Access Key as shown in the following example: + 1. Edit the properties in the core-site.xml file to include your Access Key ID and Access Key as shown in the following example: ``` @@ -27,11 +29,11 @@ When running H2O in standalone mode using the simple Java launch command, we can [AWS SECRET ACCESS KEY] ``` - 0. Launch with the configuration file `core-site.xml` by entering the following in the command line: + 2. Launch with the configuration file `core-site.xml` by entering the following in the command line: java -jar h2o.jar -hdfs_config core-site.xml - 0. Import the data using importFile with the S3 url path: + 3. Import the data using importFile with the S3 url path: `s3n://bucket/path/to/file.csv` @@ -52,7 +54,7 @@ When running H2O in standalone mode using the simple Java launch command, we can --- -##Multi-Node Instance +## Multi-Node Instance >[Python](http://www.amazon.com/Python-and-AWS-Cookbook-ebook/dp/B005ZTO0UW/ref=sr_1_1?ie=UTF8&qid=1379879111&sr=8-1&keywords=python+aws) and the [`boto`](http://boto.readthedocs.org/en/latest/) Python library are required to launch a multi-node instance of H2O on EC2. Confirm these dependencies are installed before proceeding. @@ -60,7 +62,7 @@ For more information, refer to the [H2O EC2 repo](https://github.com/h2oai/h2o-3 Build a cluster of EC2 instances by running the following commands on the host that can access the nodes using a public DNS name. -0. Edit `h2o-cluster-launch-instances.py` to include your SSH key name and security group name, as well as any other environment-specific variables. +1. Edit `h2o-cluster-launch-instances.py` to include your SSH key name and security group name, as well as any other environment-specific variables. ``` ./h2o-cluster-launch-instances.py @@ -76,17 +78,17 @@ Build a cluster of EC2 instances by running the following commands on the host t **Note**: The second method may be faster than the first, since download pulls from S3. -0. Distribute the credentials using `./h2o-cluster-distribute-aws-credentials.sh`. +2. Distribute the credentials using `./h2o-cluster-distribute-aws-credentials.sh`. >**Note**: If you are running H2O using an IAM role, it is not necessary to distribute the AWS credentials to all the nodes in the cluster. The latest version of H2O can access the temporary access key. >**Caution**: Distributing the AWS credentials copies the Amazon `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` to the instances to enable S3 and S3N access. Use caution when adding your security keys to the cloud. -0. Start H2O by launching one H2O node per EC2 instance: +3. Start H2O by launching one H2O node per EC2 instance: `./h2o-cluster-start-h2o.sh` Wait 60 seconds after entering the command before entering it on the next node. -0. In your internet browser, substitute any of the public DNS node addresses for `IP_ADDRESS` in the following example: +4. In your internet browser, substitute any of the public DNS node addresses for `IP_ADDRESS` in the following example: `http://IP_ADDRESS:54321` - To start H2O: `./h2o-cluster-start-h2o.sh` @@ -99,7 +101,7 @@ Build a cluster of EC2 instances by running the following commands on the host t -##Core-site.xml Example +## Core-site.xml Example The following is an example core-site.xml file: @@ -131,11 +133,11 @@ The following is an example core-site.xml file: --- -##Launching H2O +## Launching H2O **Note**: Before launching H2O on an EC2 cluster, verify that ports `54321` and `54322` are both accessible by TCP and UDP. -###Selecting the Operating System and Virtualization Type +### Selecting the Operating System and Virtualization Type Select your operating system and the virtualization type of the prebuilt AMI on Amazon. If you are using Windows, you will need to use a hardware-assisted virtual machine (HVM). If you are using Linux, you can choose between para-virtualization (PV) and HVM. These selections determine the type of instances you can launch. @@ -145,19 +147,19 @@ For more information about virtualization types, refer to [Amazon](http://docs.a --- -###Configuring the Instance +### Configuring the Instance -0. Select the IAM role and policy to use to launch the instance. H2O detects the temporary access keys associated with the instance, so you don't need to copy your AWS credentials to the instances. +1. Select the IAM role and policy to use to launch the instance. H2O detects the temporary access keys associated with the instance, so you don't need to copy your AWS credentials to the instances. ![EC2 Configuration](images/ec2_config.png) -0. When launching the instance, select an accessible key pair. +2. When launching the instance, select an accessible key pair. ![EC2 Key Pair](images/ec2_key_pair.png) --- -####(Windows Users) Tunneling into the Instance +#### (Windows Users) Tunneling into the Instance For Windows users that do not have the ability to use `ssh` from the terminal, either download Cygwin or a Git Bash that has the capability to run `ssh`: @@ -165,39 +167,40 @@ For Windows users that do not have the ability to use `ssh` from the terminal, e Otherwise, download PuTTY and follow these instructions: -0. Launch the PuTTY Key Generator. -0. Load your downloaded AWS pem key file. +1. Launch the PuTTY Key Generator. +2. Load your downloaded AWS pem key file. **Note:** To see the file, change the browser file type to "All". -0. Save the private key as a .ppk file. +3. Save the private key as a .ppk file. ![Private Key](images/ec2_putty_key.png) -0. Launch the PuTTY client. -0. In the *Session* section, enter the host name or IP address. For Ubuntu users, the default host name is `ubuntu@`. For Linux users, the default host name is `ec2-user@`. +4. Launch the PuTTY client. +5. In the *Session* section, enter the host name or IP address. For Ubuntu users, the default host name is `ubuntu@`. For Linux users, the default host name is `ec2-user@`. ![Configuring Session](images/ec2_putty_connect_1.png) -0. Select *SSH*, then *Auth* in the sidebar, and click the **Browse** button to select the private key file for authentication. +6. Select *SSH*, then *Auth* in the sidebar, and click the **Browse** button to select the private key file for authentication. ![Configuring SSH](images/ec2_putty_connect_2.png) -0. Start a new session and click the **Yes** button to confirm caching of the server's rsa2 key fingerprint and continue connecting. +7. Start a new session and click the **Yes** button to confirm caching of the server's rsa2 key fingerprint and continue connecting. ![PuTTY Alert](images/ec2_putty_alert.png) --- -###Downloading Java and H2O +### Downloading Java and H2O -0. Download [Java](http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html +1. Download [Java](http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html ) (JDK 1.7 or later) if it is not already available on the instance. -0. To download H2O, run the `wget` command with the link to the zip file available on our [website](http://h2o.ai/download/) by copying the link associated with the **Download** button for the selected H2O build. + +2. To download H2O, run the `wget` command with the link to the zip file available on our [website](http://h2o.ai/download/) by copying the link associated with the **Download** button for the selected H2O build. wget http://h2o-release.s3.amazonaws.com/h2o/{{branch_name}}/{{build_number}}/index.html unzip h2o-{{project_version}}.zip cd h2o-{{project_version}} java -Xmx4g -jar h2o.jar -0. From your browser, navigate to `:54321` or `:54321` to use H2O's web interface. +3. From your browser, navigate to `:54321` or `:54321` to use H2O's web interface. diff --git a/h2o-docs/src/product/howto/LDAP.md b/h2o-docs/src/product/howto/LDAP.md index bdc27d11d1d..156882e4db7 100644 --- a/h2o-docs/src/product/howto/LDAP.md +++ b/h2o-docs/src/product/howto/LDAP.md @@ -1,13 +1,13 @@ -#Connecting to H2O over LDAP +# Connecting to H2O over LDAP If your network uses an LDAP protocol, perform the following steps to connect to H2O: -0. Launch H2O. -0. Copy the URL that displays in the H2O output. In the following example, the # symbols represent the numbers in the IP address. +1. Launch H2O. +2. Copy the URL that displays in the H2O output. In the following example, the # symbols represent the numbers in the IP address. `Open H2O Flow in your web browser: https://###.##.###.##:54321` -0. Paste the URL into your browser. -0. Log in using your LDAP credentials. +3. Paste the URL into your browser. +4. Log in using your LDAP credentials. H2O is now ready to use. diff --git a/h2o-docs/src/product/howto/MOJO_QuickStart.md b/h2o-docs/src/product/howto/MOJO_QuickStart.md index a44beac0b48..6fc07ca2adb 100644 --- a/h2o-docs/src/product/howto/MOJO_QuickStart.md +++ b/h2o-docs/src/product/howto/MOJO_QuickStart.md @@ -34,56 +34,56 @@ The examples below describe how to start H2O and create a model using R and Pyth 1. Open a terminal window and start r. 2. Run the following commands to build a simple GBM model. -```R -library(h2o) -h2o.init(nthreads=-1) -path = system.file("extdata", "prostate.csv", package="h2o") -h2o_df = h2o.importFile(path) -h2o_df$RACE = as.factor(h2o_df$RACE) -model = h2o.gbm(y="CAPSULE", - x=c("AGE", "RACE", "PSA", "GLEASON"), - training_frame=h2o_df, - distribution="bernoulli", - ntrees=100, - max_depth=4, - learn_rate=0.1) -``` + ```R + library(h2o) + h2o.init(nthreads=-1) + path = system.file("extdata", "prostate.csv", package="h2o") + h2o_df = h2o.importFile(path) + h2o_df$RACE = as.factor(h2o_df$RACE) + model = h2o.gbm(y="CAPSULE", + x=c("AGE", "RACE", "PSA", "GLEASON"), + training_frame=h2o_df, + distribution="bernoulli", + ntrees=100, + max_depth=4, + learn_rate=0.1) + ``` 3. Download the MOJO and the resulting h2o-genmodel.jar file to a new **experiment** folder. -```R -modelfile = model.download_mojo(path="~/experiment/", get_genmodel_jar=True) -print("Model saved to " + modelfile) -Model saved to /Users/user/GBM_model_R_1475248925871_74.zip" -``` + ```R + modelfile = model.download_mojo(path="~/experiment/", get_genmodel_jar=True) + print("Model saved to " + modelfile) + Model saved to /Users/user/GBM_model_R_1475248925871_74.zip" + ``` **Build and extract a model using Python** 1. Open a terminal window and start python. 2. Run the following commands to build a simple GBM model. The model, along with the **h2o-genmodel.jar** file will then be downloaded to an **experiment** folder. -```R -import h2o -from h2o.estimators.gbm import H2OGradientBoostingEstimator -h2o.init() -h2o_df = h2o.load_dataset("prostate.csv") -h2o_df["CAPSULE"] = h2o_df["CAPSULE"].asfactor() -model=H2OGradientBoostingEstimator(distribution="bernoulli", - ntrees=100, - max_depth=4, - learn_rate=0.1) -model.train(y="CAPSULE", - x=["AGE","RACE","PSA","GLEASON"], - training_frame=h2o_df) -``` + ```R + import h2o + from h2o.estimators.gbm import H2OGradientBoostingEstimator + h2o.init() + h2o_df = h2o.load_dataset("prostate.csv") + h2o_df["CAPSULE"] = h2o_df["CAPSULE"].asfactor() + model=H2OGradientBoostingEstimator(distribution="bernoulli", + ntrees=100, + max_depth=4, + learn_rate=0.1) + model.train(y="CAPSULE", + x=["AGE","RACE","PSA","GLEASON"], + training_frame=h2o_df) + ``` 3. Download the MOJO and the resulting h2o-genmodel.jar file to a new **experiment** folder. -```R -modelfile = model.download_mojo(path="~/experiment/", get_genmodel_jar=True) -print("Model saved to " + modelfile) -Model saved to /Users/user/GBM_model_python_1475248925871_888.zip -``` + ```R + modelfile = model.download_mojo(path="~/experiment/", get_genmodel_jar=True) + print("Model saved to " + modelfile) + Model saved to /Users/user/GBM_model_python_1475248925871_888.zip + ``` ### Step 2: Compile and run the MOJO @@ -93,44 +93,44 @@ Model saved to /Users/user/GBM_model_python_1475248925871_888.zip 2. Create your main program in the **experiment** folder by creating a new file called main.java (for example, using "vim main.java"). Include the following contents. Note that this file references the GBM model created above using R. -```java -import java.io.*; -import hex.genmodel.easy.RowData; -import hex.genmodel.easy.EasyPredictModelWrapper; -import hex.genmodel.easy.prediction.*; -import hex.genmodel.MojoModel; - -public class main { - public static void main(String[] args) throws Exception { - EasyPredictModelWrapper model = new EasyPredictModelWrapper(MojoModel.load("GBM_model_R_1475248925871_74.zip")); - - RowData row = new RowData(); - row.put("AGE", "68"); - row.put("RACE", "2"); - row.put("DCAPS", "2"); - row.put("VOL", "0"); - row.put("GLEASON", "6"); - - BinomialModelPrediction p = model.predictBinomial(row); - System.out.println("Has penetrated the prostatic capsule (1=yes; 0=no): " + p.label); - System.out.print("Class probabilities: "); - for (int i = 0; i < p.classProbabilities.length; i++) { - if (i > 0) { - System.out.print(","); - } - System.out.print(p.classProbabilities[i]); - } - System.out.println(""); - } -} -``` + ```java + import java.io.*; + import hex.genmodel.easy.RowData; + import hex.genmodel.easy.EasyPredictModelWrapper; + import hex.genmodel.easy.prediction.*; + import hex.genmodel.MojoModel; + + public class main { + public static void main(String[] args) throws Exception { + EasyPredictModelWrapper model = new EasyPredictModelWrapper(MojoModel.load("GBM_model_R_1475248925871_74.zip")); + + RowData row = new RowData(); + row.put("AGE", "68"); + row.put("RACE", "2"); + row.put("DCAPS", "2"); + row.put("VOL", "0"); + row.put("GLEASON", "6"); + + BinomialModelPrediction p = model.predictBinomial(row); + System.out.println("Has penetrated the prostatic capsule (1=yes; 0=no): " + p.label); + System.out.print("Class probabilities: "); + for (int i = 0; i < p.classProbabilities.length; i++) { + if (i > 0) { + System.out.print(","); + } + System.out.print(p.classProbabilities[i]); + } + System.out.println(""); + } + } + ``` 3. Compile and run in terminal window 2. -```bash -$ javac -cp h2o-genmodel.jar -J-Xms2g -J-XX:MaxPermSize=128m main.java -$ java -cp .:h2o-genmodel.jar main -``` + ```bash + $ javac -cp h2o-genmodel.jar -J-Xms2g -J-XX:MaxPermSize=128m main.java + $ java -cp .:h2o-genmodel.jar main + ``` The following output displays: diff --git a/h2o-docs/src/product/howto/Videos.md b/h2o-docs/src/product/howto/Videos.md index 1d50671a774..5d3e76c104f 100644 --- a/h2o-docs/src/product/howto/Videos.md +++ b/h2o-docs/src/product/howto/Videos.md @@ -1,31 +1,31 @@ -#Quick Start Videos +# Quick Start Videos -##H2O Quick Start with Flow +## H2O Quick Start with Flow --- -##H2O Quick Start with Python +## H2O Quick Start with Python --- -##H2O Quick Start on Hadoop +## H2O Quick Start on Hadoop --- -##H2O Quick Start with Sparkling Water +## H2O Quick Start with Sparkling Water --- -##H2O Quick Start with R +## H2O Quick Start with R diff --git a/h2o-docs/src/product/howto/YARN_BP.md b/h2o-docs/src/product/howto/YARN_BP.md index 0d02d9cf98b..50de97f34c0 100644 --- a/h2o-docs/src/product/howto/YARN_BP.md +++ b/h2o-docs/src/product/howto/YARN_BP.md @@ -1,9 +1,11 @@ -#YARN Best Practices +# YARN Best Practices + +>**Note**: This topic is no longer being maintained. Refer to [YARN Best Practices](https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/welcome.rst#yarn-best-practices) for the most up-to-date documentation. YARN (Yet Another Resource Manager) is a resource management framework. H2O can be launched as an application on YARN. If you want to run H2O on Hadoop, essentially, you are running H2O on YARN. If you are not currently using YARN to manage your cluster resources, we strongly recommend it. -##Using H2O with YARN +## Using H2O with YARN When you launch H2O on Hadoop using the `hadoop jar` command, YARN allocates the necessary resources to launch the requested number of nodes. H2O launches as a MapReduce (V2) task, where each mapper is an H2O node of the specified size. @@ -19,7 +21,7 @@ To resolve configuration issues, adjust the maximum memory that YARN will allow The `mapreduce.map.memory.mb` value must be less than the YARN memory configuration values for the launch to succeed. -##Configuring YARN +## Configuring YARN **For Cloudera, configure the settings in Cloudera Manager. Depending on how the cluster is configured, you may need to change the settings for more than one role group.** @@ -63,7 +65,7 @@ To verify the values were changed, check the values for the following properties - yarn.scheduler.maximum-allocation-mb -##Limiting CPU Usage +## Limiting CPU Usage To limit the number of CPUs used by H2O, use the `-nthreads` option and specify the maximum number of CPUs for a single container to use. The following example limits the number of CPUs to four: @@ -72,7 +74,7 @@ To limit the number of CPUs used by H2O, use the `-nthreads` option and specify **Note**: The default is 4*the number of CPUs. You must specify at least four CPUs; otherwise, the following error message displays: `ERROR: nthreads invalid (must be >= 4)` -##Specifying Queues +## Specifying Queues If you do not specify a queue when launching H2O, H2O jobs are submitted to the default queue. Jobs submitted to the default queue have a lower priority than jobs submitted to a specific queue. @@ -86,16 +88,16 @@ For example, -##Specifying Output Directories +## Specifying Output Directories To prevent overwriting multiple users' files, each job must have a unique output directory name. Change the `-output hdfsOutputDir` argument (where `hdfsOutputDir` is the name of the directory. Alternatively, you can delete the directory (manually or by using a script) instead of creating a unique directory each time you launch H2O. -##Customizing YARN +## Customizing YARN Most of the configurable YARN variables are stored in `yarn-site.xml`. To prevent settings from being overridden, you can mark a config as "final." If you change any values in `yarn-site.xml`, you must restart YARN to confirm the changes. -##Accessing Logs +## Accessing Logs To learn how to access logs in YARN, refer to [Downloading Logs](http://h2o-release.s3.amazonaws.com/h2o/{{branch_name}}/{{build_number}}/docs-website/h2o-docs/index.html#Downloading%20Logs). diff --git a/h2o-docs/src/product/tutorials/GainsLift.md b/h2o-docs/src/product/tutorials/GainsLift.md index 11d17ba96ee..b7a79ce5a71 100644 --- a/h2o-docs/src/product/tutorials/GainsLift.md +++ b/h2o-docs/src/product/tutorials/GainsLift.md @@ -1,4 +1,6 @@ -#Gains/Lift Table +# Gains/Lift Table + +>**Note**: This topic is no longer being maintained. Refer to [Interpreting the Gains/Lift Chart](https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/flow.rst#interpreting-the-gains-lift-chart) for the most up-to-date documentation. The Gains/Lift table evaluates the prediction ability of a binary classification model. The accuracy of the classification model for a random sample is evaluated according to the results when the model is and is not used. @@ -22,21 +24,21 @@ The Gains/Lift table also reports for each group the threshold probability value During the Gains/Lift calculations, all rows containing missing values (NAs) in either the label (response) or the prediction probability are ignored. -##Requirements: +## Requirements: - The training frame dataset must contain actual binary class labels. - The prediction column used as the response must contain probabilities. - For GLM, the visualization displays only when using `nfolds` (for example, `nfolds=2`). - The model type cannot be K-means or PCA. -##Creating a Gains/Lift table +## Creating a Gains/Lift table -0. Import a binary classification dataset. -0. Select the model type (DL, DRF, GBM, GLM, or Naive Bayes) -0. Select the imported dataset from the drop-down *training_frame* list. -0. Select a binomial column from the drop-down *response_column* list. -0. Click the **Build Model** button, then click the **View** button after the model is complete. -0. Scroll down to view the Gains/Lift chart (as shown in the example screenshot below). +1. Import a binary classification dataset. +2. Select the model type (DL, DRF, GBM, GLM, or Naive Bayes) +3. Select the imported dataset from the drop-down *training_frame* list. +4. Select a binomial column from the drop-down *response_column* list. +5. Click the **Build Model** button, then click the **View** button after the model is complete. +6. Scroll down to view the Gains/Lift chart (as shown in the example screenshot below). ![Gains/Lift chart](images/GainsLift.png) diff --git a/h2o-docs/src/product/tutorials/GridSearch.md b/h2o-docs/src/product/tutorials/GridSearch.md index dcc5960e77f..fb05fdffd0b 100644 --- a/h2o-docs/src/product/tutorials/GridSearch.md +++ b/h2o-docs/src/product/tutorials/GridSearch.md @@ -1,11 +1,14 @@ # Grid Search (Hyperparameter Search) API + +>**Note**: This topic is no longer being maintained. Refer to [Grid Search](https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/grid-search.rst) for the most up-to-date documentation. + ## REST The current implementation of the grid search REST API exposes the following endpoints: - `GET //Grids`: List available grids, with optional parameters to sort the list by model metric such as MSE - `GET //Grids/`: Return specified grid -- `POST //Grid/`: Start a new grid search +- `POST //Grids/`: Start a new grid search - ``: Supported algorithm values are `{glm, gbm, drf, kmeans, deeplearning}` Endpoints accept model-specific parameters (e.g., [GBMParametersV3](https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/main/java/hex/schemas/GBMV3.java) and an additional parameter called `hyper_parameters` which contains a dictionary of the hyper parameters which will be searched. In this dictionary an array of values is specified for each searched hyperparameter. @@ -204,7 +207,7 @@ The following hyperparameters are supported by grid search. - `k` - `max_iterations` -##Example +## Example Invoke a new GBM model grid search by POSTing the following request to `/99/Grid/gbm`: @@ -212,7 +215,7 @@ Invoke a new GBM model grid search by POSTing the following request to `/99/Grid parms:{hyper_parameters={"ntrees":[1,5],"learn_rate":[0.1,0.01]}, training_frame="filefd41fe7ac0b_csv_1.hex_2", grid_id="gbm_grid_search", response_column="Species"", ignored_columns=[""]} ``` -##Grid Search in R +## Grid Search in R Grid search in R provides the following capabilities: @@ -224,7 +227,7 @@ Grid search in R provides the following capabilities: - `hyper_parameters` attribute for passing a list of hyper parameters (e.g., `list(ntrees=c(1,100), learn_rate=c(0.1,0.001))`) - `search_criteria` optional attribute for specifying more a advanced search strategy -###Example +### Example ```r ntrees_opts = c(1, 5) @@ -336,7 +339,7 @@ print(ntrees) For more information, refer to the [R grid search code](https://github.com/h2oai/h2o-3/blob/master/h2o-r/h2o-package/R/grid.R) and [runit_GBMGrid_airlines.R](https://github.com/h2oai/h2o-3/blob/master/h2o-r/tests/testdir_algos/gbm/runit_GBMGrid_airlines.R). -##Grid Search in Python +## Grid Search in Python - Class is `H2OGridSearch` - `.show()`: Display a list of models (including model IDs, hyperparameters, and MSE) explored by grid search (where `` is an instance of an `H2OGridSearch` class) @@ -347,7 +350,7 @@ For more information, refer to the [R grid search code](https://github.com/h2oai -###Example +### Example ```python @@ -361,7 +364,7 @@ For more information, refer to the [R grid search code](https://github.com/h2oai For more information, refer to the [Python grid search code](https://github.com/h2oai/h2o-3/blob/master/h2o-py/h2o/grid/grid_search.py) and [pyunit_benign_glm_grid.py](https://github.com/h2oai/h2o-3/blob/master/h2o-py/tests/testdir_algos/glm/pyunit_benign_glm_grid.py). -##Grid Search Java API +## Grid Search Java API Each parameter exposed by the schema can specify if it is supported by grid search by specifying the attribute `gridable=true` in the schema @API annotation. In any case, the Java API does not restrict the parameters supported by grid search. @@ -383,7 +386,7 @@ The Java API can grid search any parameters defined in the model parameter's cla Additional methods are available in the model builder to support creation of model parameters and configuration. This eliminates the requirement of the previous implementation where each gridable value was represented as a `double`. This also allows users to specify different building strategies for model parameters. For example, the REST layer uses a builder that validates parameters against the model parameter's schema, where the Java API uses a simple reflective builder. Additional reflections support is provided by PojoUtils (methods `setField`, `getFieldValue`). -###Example +### Example ```java HashMap hyperParms = new HashMap<>(); @@ -412,7 +415,7 @@ In the following example, the PCA algorithm has been implemented and we would li To add support for PCA grid search: -0. Add the PCA model build factory into the `hex.grid.ModelFactories` class: +1. Add the PCA model build factory into the `hex.grid.ModelFactories` class: ```java class ModelFactories { @@ -433,7 +436,7 @@ To add support for PCA grid search: } ``` -0. Add the PCA REST end-point schema: +2. Add the PCA REST end-point schema: ```java public class PCAGridSearchV99 extends GridSearchSchema` is not supported unless your define the class `GBMGrid extends Grid`. - Grid Job scheduler is sequential only; schedulers for concurrent builds are under development. Note that in cases of true big data sequential scheduling will yield the highest performance. It is only with a large cluster and small data that concurrent scheduling will improve performance. @@ -510,7 +513,7 @@ There are tests for the `RandomDiscrete` search criteria in [runit_GBMGrid_airli - There is no way to list the hyper space parameters that caused a model builder job failure. -##Documentation +## Documentation - H2O Core Java Developer Documentation: The definitive Java API guide for the core components of H2O. diff --git a/h2o-docs/src/product/tutorials/Interactions.md b/h2o-docs/src/product/tutorials/Interactions.md index bca27e0ca57..0623349ebf1 100644 --- a/h2o-docs/src/product/tutorials/Interactions.md +++ b/h2o-docs/src/product/tutorials/Interactions.md @@ -1,4 +1,4 @@ -#Interaction Features between Factors +# Interaction Features between Factors Use `h2o.interaction` to create interaction terms between categorical columns of an H2O frame. This feature creates N-th order interaction terms between categorical features of an H2O Frame (N=0,1,2,3,...). @@ -17,7 +17,7 @@ Use `h2o.interaction` to create interaction terms between categorical columns of -##Example +## Example ``` library(h2o) diff --git a/h2o-docs/src/product/tutorials/datascience/DataScienceH2O-Dev.md b/h2o-docs/src/product/tutorials/datascience/DataScienceH2O-Dev.md index f9a6d3a2f4a..ab727d6d219 100644 --- a/h2o-docs/src/product/tutorials/datascience/DataScienceH2O-Dev.md +++ b/h2o-docs/src/product/tutorials/datascience/DataScienceH2O-Dev.md @@ -1,23 +1,26 @@ # Data Science Algorithms +>**Note**: This topic is no longer being maintained. Refer to the topics in the [Data Science](https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/data-science) folder for the most up-to-date documentation. + + This document describes how to define the models and how to interpret the model, as well the algorithm itself, and provides an FAQ. -##Commonalities +## Commonalities -###Quantiles +### Quantiles **Note**: The quantile results in Flow are computed lazily on-demand and cached. It is a fast approximation (max - min / 1024) that is very accurate for most use cases. If the distribution is skewed, the quantile results may not be as accurate as the results obtained using `h2o.quantile` in R or `H2OFrame.quantile` in Python. -##K-Means +## K-Means -###Introduction +### Introduction K-Means falls in the general category of clustering algorithms. -###Defining a K-Means Model +### Defining a K-Means Model - **model_id**: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key. @@ -53,7 +56,7 @@ K-Means falls in the general category of clustering algorithms. - **seed**: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations. -###Interpreting a K-Means Model +### Interpreting a K-Means Model By default, the following output displays: @@ -68,7 +71,7 @@ By default, the following output displays: K-Means randomly chooses starting points and converges to a local minimum of centroids. The number of clusters is arbitrary, and should be thought of as a tuning parameter. The output is a matrix of the cluster assignments and the coordinates of the cluster centers in terms of the originally chosen attributes. Your cluster centers may differ slightly from run to run as this problem is Non-deterministic Polynomial-time (NP)-hard. -###FAQ +### FAQ - **How does the algorithm handle missing values during training?** @@ -96,7 +99,7 @@ The output is a matrix of the cluster assignments and the coordinates of the clu -###K-Means Algorithm +### K-Means Algorithm The number of clusters \(K\) is user-defined and is determined a priori. @@ -146,7 +149,7 @@ The number of clusters \(K\) is user-defined and is determined a priori. -###References +### References [Hastie, Trevor, Robert Tibshirani, and J Jerome H Friedman. The Elements of Statistical Learning. Vol.1. N.p., Springer New York, 2001.](http://www.stanford.edu/~hastie/local.ftp/Springer/OLD//ESLII_print4.pdf) @@ -155,9 +158,9 @@ Xiong, Hui, Junjie Wu, and Jian Chen. “K-means Clustering Versus Validation Me --- -##GLM +## GLM -###Introduction +### Introduction Generalized Linear Models (GLM) estimate regression models for outcomes following exponential distributions. In addition to the Gaussian (i.e. normal) distribution, these include Poisson, binomial, and gamma distributions. Each serves a different purpose, and depending on distribution and link function choice, can be used either for prediction or classification. @@ -170,7 +173,7 @@ The GLM suite includes: - Gamma regression -###Defining a GLM Model +### Defining a GLM Model - **model_id**: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key. @@ -268,7 +271,7 @@ The GLM suite includes: - **seed**: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations. -###Interpreting a GLM Model +### Interpreting a GLM Model By default, the following output displays: @@ -311,7 +314,7 @@ To make custom GLM model from R or python: - pyton: H2OGeneralizedLinearEstimator.makeGLMModel (static method), takes a model, dictionary containing coefficients and (optional) decision threshold as parameters. -###FAQ +### FAQ - **How does the algorithm handle missing values during training?** @@ -369,7 +372,7 @@ For GLM, the variable importance represents the coefficient magnitudes. -###GLM Algorithm +### GLM Algorithm Following the definitive text by P. McCullagh and J.A. Nelder (1989) on the generalization of linear models to non-linear distributions of the response variable Y, H2O fits GLM models based on the maximum likelihood estimation via iteratively reweighed least squares. @@ -442,7 +445,7 @@ Relative to P, the larger that (N/CPUs) becomes, the more trivial p becomes to t For more information about how GLM works, refer to the [Generalized Linear Modeling booklet](http://h2o.ai/resources). -###References +### References Breslow, N E. “Generalized Linear Models: Checking Assumptions and Strengthening Conclusions.” Statistica Applicata 8 (1996): 23-41. @@ -466,9 +469,9 @@ Snee, Ronald D. “Validation of Regression Models: Methods and Examples.” Tec -##DRF +## DRF -###Introduction +### Introduction Distributed Random Forest (DRF) is a powerful classification and regression tool. When given a set of data, DRF generates a forest of classification (or regression) trees, rather than a single classification (or regression) tree. Each of these trees is a weak learner built on a subset of rows and columns. More trees will reduce the variance. Both classification and regression take the average prediction over all of their trees to make a final prediction, whether predicting for a class or numeric value (note: for a categorical response column, DRF maps factors (e.g. 'dog', 'cat', 'mouse) in lexicographic order to a name lookup array with integer indices (e.g. 'cat ->0, 'dog' -> 1, 'mouse' ->2). @@ -487,7 +490,7 @@ There was some code cleanup and refactoring to support the following features: DRF no longer has a special-cased histogram for classification or regression (class DBinomHistogram has been superseded by DRealHistogram) since it was not applicable to cases with observation weights or for cross-validation. -###Defining a DRF Model +### Defining a DRF Model - **model_id**: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key. @@ -608,7 +611,7 @@ DRF no longer has a special-cased histogram for classification or regression (cl - **nbins\_top\_level**: (For numerical/real/int columns only) Specify the minimum number of bins at the root level to use to build the histogram. This number will then be decreased by a factor of two per level. -###Interpreting a DRF Model +### Interpreting a DRF Model By default, the following output displays: @@ -625,10 +628,10 @@ By default, the following output displays: - Variable importances in tabular format -###Leaf Node Assignment +### Leaf Node Assignment Trees cluster observations into leaf nodes, and this information can be useful for feature engineering or model interpretability. Use **h2o.predict\_leaf\_node\_assignment\(model, frame\)** to get an H2OFrame with the leaf node assignments, or click the checkbox when making predictions from Flow. Those leaf nodes represent decision rules that can be fed to other models (i.e., GLM with lambda search and strong rules) to obtain a limited set of the most important rules. -###FAQ +### FAQ - **How does the algorithm handle missing values during training?** @@ -681,25 +684,25 @@ For regression, the floor - in this example, (100/3)=33 columns - is used for ea `mtries` is configured independently of `col_sample_rate_per_tree`, but it can be limited by it. For example, if `col_sample_rate_per_tree=0.01`, then there's only one column left for each split, regardless of how large the value for `mtries` is. -###DRF Algorithm +### DRF Algorithm -###References +### References P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006. --- -##Naïve Bayes +## Naïve Bayes -###Introduction +### Introduction Naïve Bayes (NB) is a classification algorithm that relies on strong assumptions of the independence of covariates in applying Bayes Theorem. NB models are commonly used as an alternative to decision trees for classification problems. -###Defining a Naïve Bayes Model +### Defining a Naïve Bayes Model - **model_id**: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key. @@ -736,7 +739,7 @@ Naïve Bayes (NB) is a classification algorithm that relies on strong assumption -###Interpreting a Naïve Bayes Model +### Interpreting a Naïve Bayes Model The output from Naïve Bayes is a list of tables containing the a-priori and conditional probabilities of each class of the response. The a-priori probability is the estimated probability of a particular class before observing any of the predictors. Each conditional probability table corresponds to a predictor column. The row headers are the classes of the response and the column headers are the classes of the predictor. Thus, in the table below, the probability of survival (y) given a person is male (x) is 0.91543624. @@ -756,7 +759,7 @@ By default, the following output displays: - Y-Levels (levels of the response column) - P-conditionals -###FAQ +### FAQ - **How does the algorithm handle missing values during training?** @@ -800,7 +803,7 @@ By default, the following output displays: For Naïve Bayes, we recommend using many smaller nodes because the distributed task doesn't require intensive computation. -###Naïve Bayes Algorithm +### Naïve Bayes Algorithm The algorithm is presented for the simplified binomial case without loss of generality. @@ -855,7 +858,7 @@ Note that in the general case where y takes on k values, there are k+1 modified Laplace smoothing should be used with care; it is generally intended to allow for predictions in rare events. As prediction data becomes increasingly distinct from training data, train new models when possible to account for a broader set of possible X values. -###References +### References [Hastie, Trevor, Robert Tibshirani, and J Jerome H Friedman. The Elements of Statistical Learning. Vol.1. N.p., Springer New York, 2001.](http://www.stanford.edu/~hastie/local.ftp/Springer/OLD//ESLII_print4.pdf) @@ -865,15 +868,15 @@ Laplace smoothing should be used with care; it is generally intended to allow fo --- -##PCA +## PCA -###Introduction +### Introduction Principal Components Analysis (PCA) is closely related to Principal Components Regression. The algorithm is carried out on a set of possibly collinear features and performs a transformation to produce a new set of uncorrelated features. PCA is commonly used to model without regularization or perform dimensionality reduction. It can also be useful to carry out as a preprocessing step before distance-based algorithms such as K-Means since PCA guarantees that all dimensions of a manifold are orthogonal. -###Defining a PCA Model +### Defining a PCA Model - **model_id**: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key. @@ -911,7 +914,7 @@ PCA is commonly used to model without regularization or perform dimensionality r -###Interpreting a PCA Model +### Interpreting a PCA Model PCA output returns a table displaying the number of components specified by the value for `k`. @@ -928,7 +931,7 @@ The output for PCA includes the following: -###FAQ +### FAQ - **How does the algorithm handle missing values during scoring?** @@ -980,7 +983,7 @@ For PCA, this is dependent on the selected `pca_method` parameter: After the PCA model has been built using `h2o.prcomp`, use `h2o.predict` on the original data frame and the PCA model to produce the dimensionality-reduced representation. Use `cbind` to add the predictor column from the original data frame to the data frame produced by the output of `h2o.predict`. At this point, you can build supervised learning models on the new data frame. -###PCA Algorithm +### PCA Algorithm Let \(X\) be an \(M\times N\) matrix where @@ -1022,7 +1025,7 @@ For each eigenvalue \(\lambda\) \((C-{x}-\lambda I)x =0\) where \(x\) is the eig Solve for \(x\) by Gaussian elimination. -####Recovering SVD from GLRM +#### Recovering SVD from GLRM GLRM gives \(x\) and \(y\), where \(x \in \rm \Bbb I \!\Bbb R ^{n * k}\) and \( y \in \rm \Bbb I \!\Bbb R ^{k*m} \) @@ -1079,7 +1082,7 @@ Left singular vectors: \( (QU) \in \rm \Bbb I \!\Bbb R^{n * k}\) -###References +### References Gockenbach, Mark S. "Finite-Dimensional Linear Algebra (Discrete Mathematics and Its Applications)." (2010): 566-567. @@ -1087,9 +1090,9 @@ Gockenbach, Mark S. "Finite-Dimensional Linear Algebra (Discrete Mathematics and --- -##GBM +## GBM -###Introduction +### Introduction Gradient Boosted Regression and Gradient Boosted Classification are forward learning ensemble methods. The guiding heuristic is that good predictive results can be obtained through increasingly refined approximations. H2O's GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel. @@ -1105,7 +1108,7 @@ There was some code cleanup and refactoring to support the following features: - N-fold cross-validation - Support for more distribution functions (such as Gamma, Poisson, and Tweedie) -###Defining a GBM Model +### Defining a GBM Model - **model_id**: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key. @@ -1240,7 +1243,7 @@ There was some code cleanup and refactoring to support the following features: - **nbins\_top\_level**: (For numerical/real/int columns only) Specify the minimum number of bins at the root level to use to build the histogram. This number will then be decreased by a factor of two per level. -###Interpreting a GBM Model +### Interpreting a GBM Model The output for GBM includes the following: @@ -1253,10 +1256,10 @@ The output for GBM includes the following: - Training metrics (model name, model checksum name, frame name, description, model category, duration in ms, scoring time, predictions, MSE, R2) - Variable importances in tabular format -###Leaf Node Assignment +### Leaf Node Assignment Trees cluster observations into leaf nodes, and this information can be useful for feature engineering or model interpretability. Use **h2o.predict\_leaf\_node\_assignment\(model, frame\)** to get an H2OFrame with the leaf node assignments, or click the checkbox when making predictions from Flow. Those leaf nodes represent decision rules that can be fed to other models (i.e., GLM with lambda search and strong rules) to obtain a limited set of the most important rules. -###FAQ +### FAQ - **How does the algorithm handle missing values during training?** @@ -1324,7 +1327,7 @@ Trees cluster observations into leaf nodes, and this information can be useful f You can find tutorials for using GBM with R, Python, and Flow at the following location: https://github.com/h2oai/h2o-3/tree/master/h2o-docs/src/product/tutorials/gbm -###GBM Algorithm +### GBM Algorithm H2O's Gradient Boosting Algorithms follow the algorithm specified by Hastie et al (2001): @@ -1357,7 +1360,7 @@ The above vec has a real-valued type if passed as a whole, but if the zero-weigh For more information about the GBM algorithm, refer to the [Gradient Boosted Machines booklet](http://h2o.ai/resources). -###Binning In GBM +### Binning In GBM **Is the binning range-based or percentile-based?** @@ -1381,7 +1384,7 @@ And so on: important dense ranges with split essentially logrithmeticaly at each You can try adding a new predictor column which is either pre-binned (e.g. as a categorical - "small", "median", and "giant" values), or a log-transform - plus keep the old column. -###References +### References Dietterich, Thomas G, and Eun Bae Kong. "Machine Learning Bias, Statistical Bias, and Statistical Variance of Decision Tree @@ -1409,13 +1412,13 @@ Vol.1. N.p., page 339: Springer New York, 2001.](http://www.stanford.edu/~hastie --- -##Deep Learning +## Deep Learning -###Introduction +### Introduction H2O’s Deep Learning is based on a multi-layer feed-forward artificial neural network that is trained with stochastic gradient descent using back-propagation. The network can contain a large number of hidden layers consisting of neurons with tanh, rectifier and maxout activation functions. Advanced features such as adaptive learning rate, rate annealing, momentum training, dropout, L1 or L2 regularization, checkpointing and grid search enable high predictive accuracy. Each compute node trains a copy of the global model parameters on its local data with multi-threading (asynchronously), and contributes periodically to the global model via model averaging across the network. -###Defining a Deep Learning Model +### Defining a Deep Learning Model H2O Deep Learning models have many input parameters, many of which are only accessible via the expert mode. For most cases, use the default values. Please read the following instructions before building extensive Deep Learning models. The application of grid search and successive continuation of winning models via checkpoint restart is highly recommended, as model performance can vary greatly. @@ -1595,7 +1598,7 @@ H2O Deep Learning models have many input parameters, many of which are only acce -###Interpreting a Deep Learning Model +### Interpreting a Deep Learning Model To view the results, click the View button. The output for the Deep Learning model includes the following information for both the training and testing sets: @@ -1611,7 +1614,7 @@ To view the results, click the View button. The output for the Deep Learning mod -###FAQ +### FAQ - **How does the algorithm handle missing values during training?** @@ -1734,7 +1737,7 @@ To view the results, click the View button. The output for the Deep Learning mod --- -###Deep Learning Algorithm +### Deep Learning Algorithm To compute deviance for a Deep Learning regression model, the following formula is used: @@ -1743,7 +1746,7 @@ For Absolute/Laplace or Huber -> MSE != Deviance For more information about how the Deep Learning algorithm works, refer to the [Deep Learning booklet](http://h2o.ai/resources). -###References +### References ["Deep Learning." *Wikipedia: The free encyclopedia*. Wikimedia Foundation, Inc. 1 May 2015. Web. 4 May 2015.](http://en.wikipedia.org/wiki/Deep_learning) @@ -1813,7 +1816,7 @@ print(cvAUCs) mean(cvAUCs) ``` -#Using Cross-Validated Predictions +## Using Cross-Validated Predictions With cross-validated model building, H2O builds N+1 models: N cross-validated model and 1 overarching model over all of the training data. @@ -1872,7 +1875,7 @@ and each one has the following shape (for example the first one): The training rows receive a prediction of `0` (more on this below) as well as `0` for all class probabilities. Each of these holdout predictions has the same number of rows as the input frame. -##Combining holdout predictions +## Combining holdout predictions The frame of cross-validated predictions is simply the superposition of the individual predictions. [Here's an example from R](https://0xdata.atlassian.net/browse/PUBDEV-2236): diff --git a/h2o-docs/src/product/tutorials/dl/dl.md b/h2o-docs/src/product/tutorials/dl/dl.md index 3a3201f9294..ca57c0552af 100644 --- a/h2o-docs/src/product/tutorials/dl/dl.md +++ b/h2o-docs/src/product/tutorials/dl/dl.md @@ -1,4 +1,4 @@ -#Deep Learning Tutorial +# Deep Learning Tutorial The purpose of this tutorial is to walk new users through Deep Learning using H2O Flow. @@ -6,7 +6,7 @@ Those who have never used H2O before should refer to Definintive Performance Tuning Guide for Deep Learning. -###Using Deep Learning +### Using Deep Learning H2O’s Deep Learning functionalities include: @@ -34,67 +34,67 @@ If you don't have any data of your own to work with, you can find some example d -####Importing Data +#### Importing Data Before creating a model, import the data into H2O: -0. Click the **Assist Me!** button (the last button in the row of buttons below the menus). +1. Click the **Assist Me!** button (the last button in the row of buttons below the menus). ![Assist Me button](../images/Flow_AssistMeButton.png) -0. Click the **importFiles** link and enter the file path to the training dataset in the **Search** entry field. For this example, the following datasets are used: +2. Click the **importFiles** link and enter the file path to the training dataset in the **Search** entry field. For this example, the following datasets are used: - *Training*: https://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/mnist/train.csv.gz - *Testing*: https://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/mnist/test.csv.gz ![Importing Testing Data](../images/DL_importFile_test.png) -0. Click the **Add all** link to add the file to the import queue, then click the **Import** button. +3. Click the **Add all** link to add the file to the import queue, then click the **Import** button. ![Importing Training Data](../images/DL_importFile_train.png) -####Parsing Data +#### Parsing Data Now, parse the imported data: -0. Click the **Parse these files...** button. +1. Click the **Parse these files...** button. >**Note**: The default options typically do not need to be changed unless the data does not parse correctly. -0. From the drop-down **Parser** list, select the file type of the data set (Auto, XLS, CSV, or SVMLight). -0. If the data uses a separator, select it from the drop-down **Separator** list. -0. If the data uses a column header as the first row, select the **First row contains column names** radio button. If the first row contains data, select the **First row contains data** radio button. You can also select the **Auto** radio button to have H2O automatically determine if the first row of the dataset contains the column names or data. -0. If the data uses apostrophes ( `'` - also known as single quotes), check the **Enable single quotes as a field quotation character** checkbox. -0. Review the data in the **Edit Column Names and Types** section. The last column, `C785`, must be changed to an enum for a classification model. -0. Enter `C785` in the *Search by column name* entry field at the top. -0. Click the drop-down column heading menu for C785 and select `Enum`. +2. From the drop-down **Parser** list, select the file type of the data set (Auto, XLS, CSV, or SVMLight). +3. If the data uses a separator, select it from the drop-down **Separator** list. +4. If the data uses a column header as the first row, select the **First row contains column names** radio button. If the first row contains data, select the **First row contains data** radio button. You can also select the **Auto** radio button to have H2O automatically determine if the first row of the dataset contains the column names or data. +5. If the data uses apostrophes ( `'` - also known as single quotes), check the **Enable single quotes as a field quotation character** checkbox. +6. Review the data in the **Edit Column Names and Types** section. The last column, `C785`, must be changed to an enum for a classification model. +7. Enter `C785` in the *Search by column name* entry field at the top. +8. Click the drop-down column heading menu for C785 and select `Enum`. ![Selecting Enum](../images/DL_SelectEnum.png) -0. Click the **Parse** button. +9. Click the **Parse** button. ![Parsing Data](../images/DL_Parse.png) >**NOTE**: Make sure the parse is complete by confirming progress is 100% before continuing to the next step, model building. For small datasets, this should only take a few seconds, but larger datasets take longer to parse. -##Building a Model +## Building a Model -0. Once data are parsed, click the **View** button, then click the **Build Model** button. -0. Select `Deep Learning` from the drop-down **Select an algorithm** menu, then click the **Build model** button. -0. If the parsed training data is not already listed in the **Training_frame** drop-down list, select it. +1. Once data are parsed, click the **View** button, then click the **Build Model** button. +2. Select `Deep Learning` from the drop-down **Select an algorithm** menu, then click the **Build model** button. +3. If the parsed training data is not already listed in the **Training_frame** drop-down list, select it. >**Note**: If the **Ignore\_const\_col** checkbox is checked, a list of the excluded columns displays below the **Training_frame** drop-down list. -0. From the drop-down **Validation_frame** list, select the parsed testing (validation) data. -0. From the **Ignored_columns** section, select the columns to ignore in the *Available* area to move them to the *Selected* area. For this example, do not select any columns. -0. From the drop-down **Response** list, select the last column (`C785`). -0. From the drop-down **Activation** list, select the activation function (for this example, select `Tanh`). -0. In the **Hidden** field, specify the hidden layer sizes (for this example, enter `50,50`). -0. In the **Epochs** field, enter the number of times to iterate the dataset (for this example, enter `0.1`). -0. Click the **Build Model** button. +4. From the drop-down **Validation_frame** list, select the parsed testing (validation) data. +5. From the **Ignored_columns** section, select the columns to ignore in the *Available* area to move them to the *Selected* area. For this example, do not select any columns. +6. From the drop-down **Response** list, select the last column (`C785`). +7. From the drop-down **Activation** list, select the activation function (for this example, select `Tanh`). +8. In the **Hidden** field, specify the hidden layer sizes (for this example, enter `50,50`). +9. In the **Epochs** field, enter the number of times to iterate the dataset (for this example, enter `0.1`). +10. Click the **Build Model** button. ![Building Models](../images/DL_BuildModel.png) -##Results +## Results To view the results, click the **View** button. The output for the Deep Learning model includes the following information for both the training and testing sets: diff --git a/h2o-docs/src/product/tutorials/gbm/gbm.md b/h2o-docs/src/product/tutorials/gbm/gbm.md index ccaf5515eec..fb6af1ce1f6 100644 --- a/h2o-docs/src/product/tutorials/gbm/gbm.md +++ b/h2o-docs/src/product/tutorials/gbm/gbm.md @@ -16,30 +16,30 @@ Machine Learning repository. They are composed of 452 observations and 279 attri If you don't have any data of your own to work with, you can find some example datasets here: http://data.h2o.ai -###Importing Data +### Importing Data Before creating a model, import data into H2O: -0. Click the **Assist Me!** button (the last button in the row of buttons below the menus). +1. Click the **Assist Me!** button (the last button in the row of buttons below the menus). ![Assist Me button](../images/Flow_AssistMeButton.png) -0. Click the **importFiles** link and enter the file path to the dataset in the **Search** entry field. -0. Click the **Add all** link to add the file to the import queue, then click the **Import** button. +2. Click the **importFiles** link and enter the file path to the dataset in the **Search** entry field. +3. Click the **Add all** link to add the file to the import queue, then click the **Import** button. ![Importing Files](../images/GBM_ImportFile.png) -###Parsing Data +### Parsing Data Now, parse the imported data: -0. Click the **Parse these files...** button. +1. Click the **Parse these files...** button. >**Note**: The default options typically do not need to be changed unless the data does not parse correctly. -0. From the drop-down **Parser** list, select the file type of the data set (Auto, XLS, CSV, or SVMLight). -0. If the data uses a separator, select it from the drop-down **Separator** list. -0. If the data uses a column header as the first row, select the **First row contains column names** radio button. If the first row contains data, select the **First row contains data** radio button. You can also select the **Auto** radio button to have H2O automatically determine if the first row of the dataset contains the column names or data. -0. If the data uses apostrophes ( `'` - also known as single quotes), check the **Enable single quotes as a field quotation character** checkbox. -0. Review the data in the **Data Preview** section, then click the **Parse** button. +2. From the drop-down **Parser** list, select the file type of the data set (Auto, XLS, CSV, or SVMLight). +3. If the data uses a separator, select it from the drop-down **Separator** list. +4. If the data uses a column header as the first row, select the **First row contains column names** radio button. If the first row contains data, select the **First row contains data** radio button. You can also select the **Auto** radio button to have H2O automatically determine if the first row of the dataset contains the column names or data. +5. If the data uses apostrophes ( `'` - also known as single quotes), check the **Enable single quotes as a field quotation character** checkbox. +6. Review the data in the **Data Preview** section, then click the **Parse** button. ![Parsing Data](../images/GBM_Parse.png) @@ -50,17 +50,17 @@ Now, parse the imported data: ### Building a Model -0. Once data are parsed, click the **View** button, then click the **Build Model** button. -0. Select `Gradient Boosting Machine` from the drop-down **Select an algorithm** menu, then click the **Build model** button. -0. If the parsed arrhythmia.hex file is not already listed in the **Training_frame** drop-down list, select it. Otherwise, continue to the next step. -0. From the **Ignored_columns** section, select the columns to ignore in the *Available* area to move them to the *Selected* area. For this example, do not select any columns. -0. From the drop-down **Response** list, select column 1 (`C1`). -0. In the **Ntrees** field, specify the number of trees to build (for this example, `20`). -0. In the **Max_depth** field, specify the maximum number of edges between the top node and the furthest node as a stopping criteria (for this example, use the default value of `5`). -0. In the **Min_rows** field, specify the minimum number of observations (rows) to include in any terminal node as a stopping criteria (for this example, `25`). -0. In the **Nbins** field, specify the number of bins to use for data splitting (for this example, use the default value of `20`). The split points are evaluated at the boundaries at each of these bins. As the value of **Nbins** increases, the algorithm approximates more closely the evaluation of each individual observation as a split point. The cost of this refinement is an increase in computational time. -0. In the **Learn_rate** field, specify the tuning parameter (also known as shrinkage) to slow the convergence of the algorithm to a solution, which helps prevent overfitting. For this example, enter `0.3`. -0. Click the **Build Model** button. +1. Once data are parsed, click the **View** button, then click the **Build Model** button. +2. Select `Gradient Boosting Machine` from the drop-down **Select an algorithm** menu, then click the **Build model** button. +3. If the parsed arrhythmia.hex file is not already listed in the **Training_frame** drop-down list, select it. Otherwise, continue to the next step. +4. From the **Ignored_columns** section, select the columns to ignore in the *Available* area to move them to the *Selected* area. For this example, do not select any columns. +5. From the drop-down **Response** list, select column 1 (`C1`). +6. In the **Ntrees** field, specify the number of trees to build (for this example, `20`). +7. In the **Max_depth** field, specify the maximum number of edges between the top node and the furthest node as a stopping criteria (for this example, use the default value of `5`). +8. In the **Min_rows** field, specify the minimum number of observations (rows) to include in any terminal node as a stopping criteria (for this example, `25`). +9. In the **Nbins** field, specify the number of bins to use for data splitting (for this example, use the default value of `20`). The split points are evaluated at the boundaries at each of these bins. As the value of **Nbins** increases, the algorithm approximates more closely the evaluation of each individual observation as a split point. The cost of this refinement is an increase in computational time. +10. In the **Learn_rate** field, specify the tuning parameter (also known as shrinkage) to slow the convergence of the algorithm to a solution, which helps prevent overfitting. For this example, enter `0.3`. +11. Click the **Build Model** button. ![Building Models](../images/GBM_BuildModel.png) @@ -86,7 +86,7 @@ The output for GBM includes the following: For classification models, the MSE is based on the classification error within the tree. For regression models, MSE is calculated from the squared deviances, as it is in standard regressions. -###Viewing Predictions +### Viewing Predictions To view predictions, click the **Predict** button. From the drop-down **Frame** list, select the arrhythmia.hex file and click the **Predict** button. diff --git a/h2o-docs/src/product/tutorials/glm/glm.md b/h2o-docs/src/product/tutorials/glm/glm.md index 44c1f86ab67..4602db341bc 100644 --- a/h2o-docs/src/product/tutorials/glm/glm.md +++ b/h2o-docs/src/product/tutorials/glm/glm.md @@ -6,7 +6,7 @@ Those who have never used H2O before should refer to http://data.h2o.ai. -####Importing Data +#### Importing Data Before creating a model, import data into H2O: -0. Click the **Assist Me!** button (the last button in the row of buttons below the menus). +1. Click the **Assist Me!** button (the last button in the row of buttons below the menus). ![Assist Me button](../images/Flow_AssistMeButton.png) -0. Click the **importFiles** link and enter the file path to the dataset in the **Search** entry field. -0. Click the **Add all** link to add the file to the import queue, then click the **Import** button. +2. Click the **importFiles** link and enter the file path to the dataset in the **Search** entry field. +3. Click the **Add all** link to add the file to the import queue, then click the **Import** button. ![Importing Files](../images/GLM_ImportFile.png) -####Parsing Data +#### Parsing Data Now, parse the imported data: -0. Click the **Parse these files...** button. +1. Click the **Parse these files...** button. >**Note**: The default options typically do not need to be changed unless the data does not parse correctly. -0. From the drop-down **Parser** list, select the file type of the data set (Auto, XLS, CSV, or SVMLight). -0. If the data uses a separator, select it from the drop-down **Separator** list. -0. If the data uses a column header as the first row, select the **First row contains column names** radio button. If the first row contains data, select the **First row contains data** radio button. To have H2O automatically determine if the first row of the dataset contains column names or data, select the **Auto** radio button. -0. If the data uses apostrophes ( `'` - also known as single quotes), check the **Enable single quotes as a field quotation character** checkbox. -0. Review the data in the **Edit Column Names and Types** section, then click the **Parse** button. +2. From the drop-down **Parser** list, select the file type of the data set (Auto, XLS, CSV, or SVMLight). +3. If the data uses a separator, select it from the drop-down **Separator** list. +4. If the data uses a column header as the first row, select the **First row contains column names** radio button. If the first row contains data, select the **First row contains data** radio button. To have H2O automatically determine if the first row of the dataset contains column names or data, select the **Auto** radio button. +5. If the data uses apostrophes ( `'` - also known as single quotes), check the **Enable single quotes as a field quotation character** checkbox. +6. Review the data in the **Edit Column Names and Types** section, then click the **Parse** button. ![Parsing Data](../images/GLM_Parse.png) @@ -58,16 +58,16 @@ Now, parse the imported data: ### Building a Model -0. Once data are parsed, click the **View** button, then click the **Build Model** button. -0. Select `Generalized Linear Model` from the drop-down **Select an algorithm** menu, then click the **Build model** button. -0. If the parsed Abalone .hex file is not already listed in the **Training_frame** drop-down list, select it. Otherwise, continue to the next step. -0. To generate a scoring history, select the abalone.hex file from the **Validation_frame** drop-down list. -0. In the **Response** field, select the column associated with the Whole Weight variable (`C9`). -0. From the drop-down **Family** menu, select `gaussian`. -0. Enter `0.3` in the **Alpha** field. The alpha parameter is the mixing parameter for the L1 and L2 penalty. -0. Enter `.002` in the **Lambda** field. -0. Check the **Lambda_search** checkbox. -0. Click the **Build Model** button. +1. Once data are parsed, click the **View** button, then click the **Build Model** button. +2. Select `Generalized Linear Model` from the drop-down **Select an algorithm** menu, then click the **Build model** button. +3. If the parsed Abalone .hex file is not already listed in the **Training_frame** drop-down list, select it. Otherwise, continue to the next step. +4. To generate a scoring history, select the abalone.hex file from the **Validation_frame** drop-down list. +5. In the **Response** field, select the column associated with the Whole Weight variable (`C9`). +6. From the drop-down **Family** menu, select `gaussian`. +7. Enter `0.3` in the **Alpha** field. The alpha parameter is the mixing parameter for the L1 and L2 penalty. +8. Enter `.002` in the **Lambda** field. +9. Check the **Lambda_search** checkbox. +10. Click the **Build Model** button. ![Building Models](../images/GLM_BuildModel.png) diff --git a/h2o-docs/src/product/tutorials/glossary.md b/h2o-docs/src/product/tutorials/glossary.md index 7f5a05868d7..22a3785e903 100644 --- a/h2o-docs/src/product/tutorials/glossary.md +++ b/h2o-docs/src/product/tutorials/glossary.md @@ -1,4 +1,4 @@ -#Glossary +# Glossary Term | Definition| ------------ | ------------- | diff --git a/h2o-docs/src/product/tutorials/kmeans/kmeans.md b/h2o-docs/src/product/tutorials/kmeans/kmeans.md index e1739d54bc6..7b8b4f267c4 100644 --- a/h2o-docs/src/product/tutorials/kmeans/kmeans.md +++ b/h2o-docs/src/product/tutorials/kmeans/kmeans.md @@ -19,30 +19,30 @@ The data are composed of 210 observations, 7 attributes, and an a priori groupin If you don't have any data of your own to work with, you can find some example datasets at http://data.h2o.ai. -####Importing Data +#### Importing Data Before creating a model, import data into H2O: -0. Click the **Assist Me!** button (the last button in the row of buttons below the menus). +1. Click the **Assist Me!** button (the last button in the row of buttons below the menus). ![Assist Me button](../images/Flow_AssistMeButton.png) -0. Click the **importFiles** link and enter the file path to the dataset in the **Search** entry field. -0. Click the **Add all** link to add the file to the import queue, then click the **Import** button. +2. Click the **importFiles** link and enter the file path to the dataset in the **Search** entry field. +3. Click the **Add all** link to add the file to the import queue, then click the **Import** button. ![Importing Files](../images/KM_ImportFile.png) -####Parsing Data +#### Parsing Data Now, parse the imported data: -0. Click the **Parse these files...** button. +1. Click the **Parse these files...** button. >**Note**: The default options typically do not need to be changed unless the data does not parse correctly. -0. From the drop-down **Parser** list, select the file type of the data set (Auto, XLS, CSV, or SVMLight). -0. If the data uses a separator, select it from the drop-down **Separator** list. -0. If the data uses a column header as the first row, select the **First row contains column names** radio button. If the first row contains data, select the **First row contains data** radio button. You can also select the **Auto** radio button to have H2O automatically determine if the first row of the dataset contains the column names or data. -0. If the data uses apostrophes ( `'` - also known as single quotes), check the **Enable single quotes as a field quotation character** checkbox. -0. Review the data in the **Edit Column Names and Types** section, then click the **Parse** button. +2. From the drop-down **Parser** list, select the file type of the data set (Auto, XLS, CSV, or SVMLight). +3. If the data uses a separator, select it from the drop-down **Separator** list. +4. If the data uses a column header as the first row, select the **First row contains column names** radio button. If the first row contains data, select the **First row contains data** radio button. You can also select the **Auto** radio button to have H2O automatically determine if the first row of the dataset contains the column names or data. +5. If the data uses apostrophes ( `'` - also known as single quotes), check the **Enable single quotes as a field quotation character** checkbox. +6. Review the data in the **Edit Column Names and Types** section, then click the **Parse** button. ![Parsing Data](../images/KM_Parse.png) @@ -52,13 +52,13 @@ Now, parse the imported data: ### Building a Model -0. Once data are parsed, click the **View** button, then click the **Build Model** button. -0. Select `K-means` from the drop-down **Select an algorithm** menu, then click the **Build model** button. -0. If the parsed seeds_dataset.hex file is not already listed in the **Training_frame** drop-down list, select it. Otherwise, continue to the next step. -0. From the **Ignored_columns** section, select the columns to ignore in the *Available* area to move them to the *Selected* area. For this example, select column 8 (the a priori known clusters for this dataset). -0. In the **K** field, specify the number of clusters. For this example, enter `3`. -0. In the **Max_iterations** field, specify the maximum number of iterations. For this example, enter `100`. -0. From the drop-down **Init** menu, select the initialization mode. For this example, select **PlusPlus**. +1. Once data are parsed, click the **View** button, then click the **Build Model** button. +2. Select `K-means` from the drop-down **Select an algorithm** menu, then click the **Build model** button. +3. If the parsed seeds_dataset.hex file is not already listed in the **Training_frame** drop-down list, select it. Otherwise, continue to the next step. +4. From the **Ignored_columns** section, select the columns to ignore in the *Available* area to move them to the *Selected* area. For this example, select column 8 (the a priori known clusters for this dataset). +5. In the **K** field, specify the number of clusters. For this example, enter `3`. +6. In the **Max_iterations** field, specify the maximum number of iterations. For this example, enter `100`. +7. From the drop-down **Init** menu, select the initialization mode. For this example, select **PlusPlus**. - Random initialization randomly samples the `k`-specified value of the rows of the training data as cluster centers. - PlusPlus initialization chooses one initial center at random and weights the random selection of subsequent centers so that points furthest from the first center are more likely to be chosen. - Furthest initialization chooses one initial center at random and then chooses the next center to be the point furthest away in terms of Euclidean distance. @@ -66,8 +66,8 @@ Now, parse the imported data: **Note**: The user-specified points dataset must have the same number of columns as the training dataset. -0. Uncheck the **Standardize** checkbox to disable column standardization. -0. Click the **Build Model** button. +8. Uncheck the **Standardize** checkbox to disable column standardization. +9. Click the **Build Model** button. ![K-Means Model Builder cell](../images/Kmeans_BuildModel.png) diff --git a/h2o-docs/src/product/tutorials/pca/pca.md b/h2o-docs/src/product/tutorials/pca/pca.md index 4e2b2c24245..67de1949e26 100644 --- a/h2o-docs/src/product/tutorials/pca/pca.md +++ b/h2o-docs/src/product/tutorials/pca/pca.md @@ -21,28 +21,28 @@ Machine Learning Repository. They are composed of 452 observations and If you don't have any data of your own to work with, you can find some example datasets at http://data.h2o.ai. -####Importing Data +#### Importing Data Before creating a model, import data into H2O: -0. Click the **Assist Me!** button in the *Help* tab in the sidebar on the right side of the page. +1. Click the **Assist Me!** button in the *Help* tab in the sidebar on the right side of the page. ![Assist Me button](../images/AssistButton.png) -0. Click the **importFiles** link and enter the file path to the dataset in the **Search** entry field. -0. Click the **Add all** link to add the file to the import queue, then click the **Import** button. +2. Click the **importFiles** link and enter the file path to the dataset in the **Search** entry field. +3. Click the **Add all** link to add the file to the import queue, then click the **Import** button. ![Importing Files](../images/GBM_ImportFile.png) -####Parsing Data +#### Parsing Data Now, parse the imported data: -0. Click the **Parse these files...** button. +1. Click the **Parse these files...** button. >**Note**: The default options typically do not need to be changed unless the data does not parse correctly. -0. From the drop-down **Parser** list, select the file type of the data set (Auto, XLS, CSV, or SVMLight). -0. If the data uses a separator, select it from the drop-down **Separator** list. -0. If the data uses a column header as the first row, select the **First row contains column names** radio button. If the first row contains data, select the **First row contains data** radio button. You can also select the **Auto** radio button to have H2O automatically determine if the first row of the dataset contains the column names or data. -0. If the data uses apostrophes ( `'` - also known as single quotes), check the **Enable single quotes as a field quotation character** checkbox. -0. Review the data in the **Edit Column Names and Types** section, then click the **Parse** button. +2. From the drop-down **Parser** list, select the file type of the data set (Auto, XLS, CSV, or SVMLight). +3. If the data uses a separator, select it from the drop-down **Separator** list. +4. If the data uses a column header as the first row, select the **First row contains column names** radio button. If the first row contains data, select the **First row contains data** radio button. You can also select the **Auto** radio button to have H2O automatically determine if the first row of the dataset contains the column names or data. +5. If the data uses apostrophes ( `'` - also known as single quotes), check the **Enable single quotes as a field quotation character** checkbox. +6. Review the data in the **Edit Column Names and Types** section, then click the **Parse** button. ![Parsing Data](../images/GBM_Parse.png) @@ -51,13 +51,13 @@ Now, parse the imported data: ### Building a Model -0. Once data are parsed, click the **View** button, then click the **Build Model** button. -0. Select `Principal Component Analysis` from the drop-down **Select an algorithm** menu, then click the **Build model** button. -0. If the parsed arrhythmia.hex file is not already listed in the **Training_frame** drop-down list, select it. Otherwise, continue to the next step. -0. From the drop-down **pca_method** menu, select the method for computing PCA. For this example, select *GramSVD*. The *GramSVD* option forms the Gram matrix of the training frame via a distributed computation, then computes the singular value decomposition (SVD) of the Gram locally using the JAMA package. The principal component vectors and standard deviations are recovered from the SVD. -0. In the **K** field, specify the number of clusters. For this example, enter `3`. -0. In the **Max_iterations** field, specify the maximum number of iterations. For this example, enter `100`. -0. Click the **Build Model** button. +1. Once data are parsed, click the **View** button, then click the **Build Model** button. +2. Select `Principal Component Analysis` from the drop-down **Select an algorithm** menu, then click the **Build model** button. +3. If the parsed arrhythmia.hex file is not already listed in the **Training_frame** drop-down list, select it. Otherwise, continue to the next step. +4. From the drop-down **pca_method** menu, select the method for computing PCA. For this example, select *GramSVD*. The *GramSVD* option forms the Gram matrix of the training frame via a distributed computation, then computes the singular value decomposition (SVD) of the Gram locally using the JAMA package. The principal component vectors and standard deviations are recovered from the SVD. +5. In the **K** field, specify the number of clusters. For this example, enter `3`. +6. In the **Max_iterations** field, specify the maximum number of iterations. For this example, enter `100`. +7. Click the **Build Model** button. ![Building PCA Models](../images/PCA_BuildModel.png) diff --git a/h2o-docs/src/product/tutorials/rf/rf.md b/h2o-docs/src/product/tutorials/rf/rf.md index 1d1abdbf3c5..94ca5922351 100644 --- a/h2o-docs/src/product/tutorials/rf/rf.md +++ b/h2o-docs/src/product/tutorials/rf/rf.md @@ -14,41 +14,42 @@ The data are composed of 3279 observations, 1557 attributes, and an a priori gro If you don't have any data of your own to work with, you can find some example datasets at http://data.h2o.ai. -####Importing Data +#### Importing Data Before creating a model, import data into H2O: -0. Click the **Assist Me!** button (the last button in the row of buttons below the menus). +1. Click the **Assist Me!** button (the last button in the row of buttons below the menus). ![Assist Me button](../images/Flow_AssistMeButton.png) -0. Click the **importFiles** link and enter the file path to the dataset in the **Search** entry field. -0. Click the **Add all** link to add the file to the import queue, then click the **Import** button. +2. Click the **importFiles** link and enter the file path to the dataset in the **Search** entry field. +3. Click the **Add all** link to add the file to the import queue, then click the **Import** button. ![Importing Files](../images/RF_ImportFile.png) -####Parsing Data +#### Parsing Data + Now, parse the imported data: -0. Click the **Parse these files...** button. +1. Click the **Parse these files...** button. **Note**: The default options typically do not need to be changed unless the data does not parse correctly. -0. From the drop-down **Parser** list, select the file type of the data set (Auto, XLS, CSV, or SVMLight). -0. If the data uses a separator, select it from the drop-down **Separator** list. -0. If the data uses a column header as the first row, select the **First row contains column names** radio button. If the first row contains data, select the **First row contains data** radio button. To have H2O automatically determine if the first row of the dataset contains column names or data, select the **Auto** radio button. -0. If the data uses apostrophes ( `'` - also known as single quotes), check the **Enable single quotes as a field quotation character** checkbox. -0. To delete the imported dataset after parsing, check the **Delete on done** checkbox. +2. From the drop-down **Parser** list, select the file type of the data set (Auto, XLS, CSV, or SVMLight). +3. If the data uses a separator, select it from the drop-down **Separator** list. +4. If the data uses a column header as the first row, select the **First row contains column names** radio button. If the first row contains data, select the **First row contains data** radio button. To have H2O automatically determine if the first row of the dataset contains column names or data, select the **Auto** radio button. +5. If the data uses apostrophes ( `'` - also known as single quotes), check the **Enable single quotes as a field quotation character** checkbox. +6. To delete the imported dataset after parsing, check the **Delete on done** checkbox. **NOTE**: In general, we recommend enabling this option. Retaining data requires memory resources, but does not aid in modeling because unparsed data cannot be used by H2O. -0. Review the data in the **Edit Column Names and Types** section. -0. Click the **Next page** button until you reach the last page. +7. Review the data in the **Edit Column Names and Types** section. +8. Click the **Next page** button until you reach the last page. ![Page buttons](../images/Flow_PageButtons.png) -0. For column 1559, select `Enum` from the drop-down column type menu. -0. Click the **Parse** button. +9. For column 1559, select `Enum` from the drop-down column type menu. +10. Click the **Parse** button. ![Parsing Data](../images/RF_Parse.png) @@ -58,14 +59,14 @@ Now, parse the imported data: ### Building a Model -0. Once data are parsed, click the **View** button, then click the **Build Model** button. -0. Select `Distributed RF` from the drop-down **Select an algorithm** menu, then click the **Build model** button. -0. If the parsed ad.hex file is not already listed in the **Training_frame** drop-down list, select it. Otherwise, continue to the next step. -0. From the **Response column** drop-down list, select `C1`. -0. In the **Ntrees** field, specify the number of trees for the model to build. For this example, enter `150`. -0. In the **Max_depth** field, specify the maximum distance from the root to the terminal node. For this example, use the default value of `20`. -0. In the **Mtries** field, specify the number of features on which the trees will be split. For this example, enter `1000`. -0. Click the **Build Model** button. +1. Once data are parsed, click the **View** button, then click the **Build Model** button. +2. Select `Distributed RF` from the drop-down **Select an algorithm** menu, then click the **Build model** button. +3. If the parsed ad.hex file is not already listed in the **Training_frame** drop-down list, select it. Otherwise, continue to the next step. +4. From the **Response column** drop-down list, select `C1`. +5. In the **Ntrees** field, specify the number of trees for the model to build. For this example, enter `150`. +6. In the **Max_depth** field, specify the maximum distance from the root to the terminal node. For this example, use the default value of `20`. +7. In the **Mtries** field, specify the number of features on which the trees will be split. For this example, enter `1000`. +8. Click the **Build Model** button. ![Random Forest Model Builder](../images/RF_BuildModel.png) diff --git a/h2o-docs/src/product/upgrade/H2OBenefits.md b/h2o-docs/src/product/upgrade/H2OBenefits.md index c605758ac37..97893cbc6eb 100644 --- a/h2o-docs/src/product/upgrade/H2OBenefits.md +++ b/h2o-docs/src/product/upgrade/H2OBenefits.md @@ -10,30 +10,30 @@ The main benefits to using H2O are: - **Fast performance**: Create models in minutes using H2O's unique in-memory capabilities -##What H2O Provides +## What H2O Provides -###Better Predictions +### Better Predictions - Powerful, ready-to-use algorithms that derive insights from all your data -###Speed +### Speed - In-memory parallel processing for real-time responsiveness, increasing efficiency, and running models without sampling -###Ease of Use +### Ease of Use - Flow, an intuitive web UI that is designed to simplify a data scientist's workflow, allows you to modify, save, export, and share your workflow with others -###Extensibility +### Extensibility - Seamless Hadoop integration with distributed data ingestion from HDFS and S3 - Models are built using Java and can be exported as Plain Old Java Objects (POJO) for integration in your custom application -###Scalability +### Scalability - Easy to iterate, develop, and train models on large data without extra modeling time -###Real-time Scoring +### Real-time Scoring - Predict and score more accurately and 10x faster than the next best technology on the market @@ -48,6 +48,9 @@ K-Means | A method to uncover groups or clusters of data points often used for s Anomaly Detection | Identify the outliers in your data by invoking a powerful pattern recognition model. Deep Learning | Model high-level abstractions in data by using non-linear transformations in a layer-by-layer method. Deep learning is an example of unsupervised learning and can make use of unlabeled data that other algorithms cannot. Naïve Bayes | A probabilistic classifier that assumes the value of a particular feature is unrelated to the presence or absence of any other feature, given the class variable. It is often used in text categorization. +Stacked Ensembles | A supervised ensemble machine learning algorithm that finds the optimal combination of a collection of prediction algorithms using a process called stacking. +XGBoost | An optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. This algorithm provides parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way. +Word2vec | An algorithm that takes a text corpus as an input and produces the word vectors as output. The algorithm first creates a vocabulary from the training text data and then learns vector representations of the words. ### Scoring Models with Confidence Score Tool | Description @@ -60,7 +63,7 @@ AUC | A graphical plot to visualize the performance of a model by its sensitivit --- -###Use Cases +### Use Cases - Fraud detection - Churn identification to prevent turnover @@ -71,7 +74,7 @@ AUC | A graphical plot to visualize the performance of a model by its sensitivit - Evaluation of ad campaign effectiveness - Customer classification to predict purchase behavior or renewal rates -###Customer Examples +### Customer Examples - Cisco saw a 15x increase in speed after implementing H2O into their Propensity to Buy (P2B) modeling factory. - Paypal uses H2O's Deep Learning algorithm for fraud detection and prevention. @@ -79,26 +82,32 @@ AUC | A graphical plot to visualize the performance of a model by its sensitivit - MarketShare uses H2O for marketing optimization to improve efficiency in cross-channel attribution and forecasting. -##Required Resources +## Required Resources -###Hardware and Software +### Hardware and Software - Java is required to run H2O. GNU compiler for Java and Open JDK are not supported. - The amount of memory required depends on the size of your data. We recommend having four times as much memory as your largest dataset. - To view a one-page document that outlines the system configurations we recommend, click [here](http://h2o.ai/product/recommended-systems-for-h2o/). -###Data +### Data H2O works with tabular data, which can be imported as a single file or as a directory of files. The following formats are supported: - - ARFF - - XLS/XLSX - - CSV - - SVMLight +CSV (delimited) files +ORC +SVMLight +ARFF +XLS +XLSX +Avro (without multifile parsing or column type modification) +Parquet + +>Note that ORC is available only if H2O is running as a Hadoop job. The data does not need to be perfect, as some munging can be performed within H2O (such as excluding columns with a specified percentage of missing values). However, the more precise your dataset is, the more accurate your models will be. -###Support Team +### Support Team H2O is designed to be easy to both set up and use, but we recommend assembling a team that includes: diff --git a/h2o-docs/src/product/upgrade/H2ODevPortingRScripts.md b/h2o-docs/src/product/upgrade/H2ODevPortingRScripts.md index 88edc5e201b..47383fd7e41 100644 --- a/h2o-docs/src/product/upgrade/H2ODevPortingRScripts.md +++ b/h2o-docs/src/product/upgrade/H2ODevPortingRScripts.md @@ -1,4 +1,4 @@ -#Porting R Scripts +# Porting R Scripts This document outlines how to port R scripts written in previous versions of H2O (Nunes 2.8.6.2 or prior, also known as "H2O Classic") for compatibility with the new H2O 3.0 API. When upgrading from H2O to H2O 3.0, most functions are the same. However, there are some differences that will need to be resolved when porting any scripts that were originally created using H2O to H2O 3.0. @@ -10,9 +10,9 @@ For additional assistance within R, enter a question mark before the command (fo There is also a "shim" available that will review R scripts created with previous versions of H2O, identify deprecated or renamed parameters, and suggest replacements. For more information, refer to the repo [here](https://github.com/h2oai/h2o-dev/blob/d9693a97da939a2b77c24507c8b40a5992192489/h2o-r/h2o-package/R/shim.R). -##Changes from H2O 2.8 to H2O 3.0 +## Changes from H2O 2.8 to H2O 3.0 -###`h2o.exec` +### `h2o.exec` The `h2o.exec` command is no longer supported. Any workflows using `h2o.exec` must be revised to remove this command. If the H2O 3.0 workflow contains any parameters or commands from H2O Classic, errors will result and the workflow will fail. The purpose of `h2o.exec` was to wrap expressions so that they could be evaluated in a single `\Exec2` call. For example, @@ -34,23 +34,23 @@ A String array is ["f00", "b4r"], *not* "[\"f00\", \"b4r\"]" Only string values are enclosed in double quotation marks (`"`). -###`h2o.performance` +### `h2o.performance` To access any exclusively binomial output, use `h2o.performance`, optionally with the corresponding accessor. The accessor can only use the model metrics object created by `h2o.performance`. Each accessor is named for its corresponding field (for example, `h2o.AUC`, `h2o.gini`, `h2o.F1`). `h2o.performance` supports all current algorithms except for K-Means. If you specify a data frame as a second parameter, H2O will use the specified data frame for scoring. If you do not specify a second parameter, the training metrics for the model metrics object are used. -###`xval` and `validation` slots +### `xval` and `validation` slots The `xval` slot has been removed, as `nfolds` is not currently supported. The `validation` slot has been merged with the `model` slot. -###Principal Components Regression (PCR) +### Principal Components Regression (PCR) Principal Components Regression (PCR) has also been deprecated. To obtain PCR values, create a Principal Components Analysis (PCA) model, then create a GLM model from the scored data from the PCA model. -###Saving and Loading Models +### Saving and Loading Models Saving and loading a model from R is supported in version 3.0.0.18 and later. H2O 3.0 uses the same binary serialization method as previous versions of H2O, but saves the model and its dependencies into a directory, with each object as a separate file. The `save_CV` option for available in previous versions of H2O has been deprecated, as `h2o.saveAll` and `h2o.loadAll` are not currently supported. The following commands are now supported: @@ -70,11 +70,11 @@ Saving and loading a model from R is supported in version 3.0.0.18 and later. H2 -##GBM +## GBM N-fold cross-validation and grid search will be supported in a future version of H2O 3.0. -###Renamed GBM Parameters +### Renamed GBM Parameters The following parameters have been renamed, but retain the same functions: @@ -92,7 +92,7 @@ H2O Classic Parameter Name | H2O 3.0 Parameter Name `max.after.balance.size` | `max_after_balance_size` -###Deprecated GBM Parameters +### Deprecated GBM Parameters The following parameters have been removed: @@ -101,7 +101,7 @@ The following parameters have been removed: - `holdout.fraction`: The fraction of the training data to hold out for validation is no longer supported. - `grid.parallelism`: Specifying the number of parallel threads to run during a grid search is no longer supported. Grid search will be supported in a future version of H2O 3.0. -###New GBM Parameters +### New GBM Parameters The following parameters have been added: @@ -109,7 +109,7 @@ The following parameters have been added: - `score_each_iteration`: Display error rate information after each tree in the requested set is built. - `build_tree_one_node`: Run on a single node to use fewer CPUs. -###GBM Algorithm Comparison +### GBM Algorithm Comparison H2O Classic | H2O 3.0 ------------- | ------------- @@ -151,7 +151,7 @@ H2O Classic | H2O 3.0 `class.sampling.factors = NULL,` | `grid.parallelism = 1)` | -###Output +### Output The following table provides the component name in H2O, the corresponding component name in H2O 3.0 (if supported), and the model type (binomial, multinomial, or all). Many components are now included in `h2o.performance`; for more information, refer to [(`h2o.performance`)](#h2operf). @@ -182,11 +182,11 @@ H2O Classic | H2O 3.0 | Model Type --- -##GLM +## GLM N-fold cross-validation and grid search will be supported in a future version of H2O 3.0. -###Renamed GLM Parameters +### Renamed GLM Parameters The following parameters have been renamed, but retain the same functions: @@ -199,7 +199,7 @@ H2O Classic Parameter Name | H2O 3.0 Parameter Name `iter.max` | `max_iterations` `epsilon` | `beta_epsilon` -###Deprecated GLM Parameters +### Deprecated GLM Parameters The following parameters have been removed: @@ -212,14 +212,14 @@ The following parameters have been removed: - `offset`: Specify a column as an offset. (may be re-added) - `max_predictors`: Stops training the algorithm if the number of predictors exceeds the specified value. (may be re-added) -###New GLM Parameters +### New GLM Parameters The following parameters have been added: - `validation_frame`: Specify the validation dataset. - `solver`: Select IRLSM or LBFGS. -###GLM Algorithm Comparison +### GLM Algorithm Comparison H2O Classic | H2O 3.0 @@ -258,7 +258,7 @@ H2O Classic | H2O 3.0 `max_predictors = -1)` | -###Output +### Output The following table provides the component name in H2O, the corresponding component name in H2O 3.0 (if supported), and the model type (binomial, multinomial, or all). Many components are now included in `h2o.performance`; for more information, refer to [(`h2o.performance`)](#h2operf). @@ -284,9 +284,9 @@ H2O Classic | H2O 3.0 | Model Type `@model$confusion` |   | `binomial` -##K-Means +## K-Means -###Renamed K-Means Parameters +### Renamed K-Means Parameters The following parameters have been renamed, but retain the same functions: @@ -301,14 +301,14 @@ H2O Classic Parameter Name | H2O 3.0 Parameter Name **Note** In H2O, the `normalize` parameter was disabled by default. The `standardize` parameter is enabled by default in H2O 3.0 to provide more accurate results for datasets containing columns with large values. -###New K-Means Parameters +### New K-Means Parameters The following parameters have been added: - `user` has been added as an additional option for the `init` parameter. Using this parameter forces the K-Means algorithm to start at the user-specified points. - `user_points`: Specify starting points for the K-Means algorithm. -###K-Means Algorithm Comparison +### K-Means Algorithm Comparison H2O Classic | H2O 3.0 ------------- | ------------- @@ -322,7 +322,7 @@ H2O Classic | H2O 3.0 `init = "none",` | `init = c("Furthest","Random", "PlusPlus"),` `seed = 0,` | `seed)` -###Output +### Output The following table provides the component name in H2O and the corresponding component name in H2O 3.0 (if supported). @@ -340,13 +340,13 @@ H2O Classic | H2O 3.0 --- -##Deep Learning +## Deep Learning N-fold cross-validation and grid search will be supported in a future version of H2O 3.0. **Note**: If the results in the confusion matrix are incorrect, verify that `score_training_samples` is equal to 0. By default, only the first 10,000 rows are included. -###Renamed Deep Learning Parameters +### Renamed Deep Learning Parameters The following parameters have been renamed, but retain the same functions: @@ -360,7 +360,7 @@ H2O Classic Parameter Name | H2O 3.0 Parameter Name `dlmodel@model$valid_class_error` | `@model$validation_metrics@$MSE` -###Deprecated DL Parameters +### Deprecated DL Parameters The following parameters have been removed: @@ -368,7 +368,7 @@ The following parameters have been removed: - `holdout_fraction`: Fraction of the training data to hold out for validation. - `dlmodel@model$best_cutoff`: This output parameter has been removed. -###New DL Parameters +### New DL Parameters The following parameters have been added: @@ -379,7 +379,7 @@ The following options for the `loss` parameter have been added: - `absolute`: Provides strong penalties for mispredictions - `huber`: Can improve results for regression -###DL Algorithm Comparison +### DL Algorithm Comparison H2O Classic | H2O 3.0 ------------- | ------------- @@ -460,7 +460,7 @@ H2O Classic | H2O 3.0   | `keep_cross_validation_predictions = FALSE)` -###Output +### Output The following table provides the component name in H2O, the corresponding component name in H2O 3.0 (if supported), and the model type (binomial, multinomial, or all). Many components are now included in `h2o.performance`; for more information, refer to [(`h2o.performance`)](#h2operf). @@ -482,9 +482,9 @@ H2O Classic | H2O 3.0 | Model Type --- -##Distributed Random Forest +## Distributed Random Forest -###Changes to DRF in H2O 3.0 +### Changes to DRF in H2O 3.0 Distributed Random Forest (DRF) was represented as `h2o.randomForest(type="BigData", ...)` in H2O Classic. In H2O Classic, SpeeDRF (`type="fast"`) was not as accurate, especially for complex data with categoricals, and did not address regression problems. DRF (`type="BigData"`) was at least as accurate as SpeeDRF (`type="fast"`) and was the only algorithm that scaled to big data (data too large to fit on a single node). In H2O 3.0, our plan is to improve the performance of DRF so that the data fits on a single node (optimally, for all cases), which will make SpeeDRF obsolete. Ultimately, the goal is provide a single algorithm that provides the "best of both worlds" for all datasets and use cases. @@ -492,7 +492,7 @@ Please note that H2O does not currently support the ability to specify the numbe **Note**: H2O 3.0 only supports DRF. SpeeDRF is no longer supported. The functionality of DRF in H2O 3.0 is similar to DRF functionality in H2O. -###Renamed DRF Parameters +### Renamed DRF Parameters The following parameters have been renamed, but retain the same functions: @@ -510,7 +510,7 @@ H2O Classic Parameter Name | H2O 3.0 Parameter Name `nodesize` | `min_rows` -###Deprecated DRF Parameters +### Deprecated DRF Parameters The following parameters have been removed: @@ -525,13 +525,13 @@ The following parameters have been removed: -###New DRF Parameters +### New DRF Parameters The following parameter has been added: - `build_tree_one_node`: Run on a single node to use fewer CPUs. -###DRF Algorithm Comparison +### DRF Algorithm Comparison H2O Classic | H2O 3.0 ------------- | ------------- @@ -565,7 +565,7 @@ H2O Classic | H2O 3.0 `type = "fast")` | -###Output +### Output The following table provides the component name in H2O, the corresponding component name in H2O 3.0 (if supported), and the model type (binomial, multinomial, or all). Many components are now included in `h2o.performance`; for more information, refer to [(`h2o.performance`)](#h2operf). diff --git a/h2o-docs/src/product/upgrade/JavaChanges.md b/h2o-docs/src/product/upgrade/JavaChanges.md index 356e70cd31c..cda8256c2f0 100644 --- a/h2o-docs/src/product/upgrade/JavaChanges.md +++ b/h2o-docs/src/product/upgrade/JavaChanges.md @@ -1,23 +1,23 @@ -#Java Changes +# Java Changes This document describes the changes in the Java API from early versions of H2O 3.0 and later to the current version. -##Unify distribution parameter cross algorithms +## Unify distribution parameter cross algorithms The representation of the distribution family has been unified across the H2O code base. The `GBMParameters#_distribution` type has been changed from `GBMModel.GBMParameters.Family` to `hex.genmodel.utils.DistributionFamily`. The enum `GBMModel.GBMParameters.Family` has been deprecated. Use the enum `hex.genmodel.utils.DistributionFamily` instead. -##`ValueString#equals` semantics changed +## `ValueString#equals` semantics changed This change affects all comparisons using the form `new ValueString("test") == "test"`. In previous versions of H2O, the method `water.parser.BufferedString#equals` was used for comparing Java strings. This method has been deprecated; instead, use the `toString` method to convert the ValueString to a Java string, then compare the results using the `String#equals` method. -##Start of H2O client app changed +## Start of H2O client app changed The method `water.H2OClientApp#start` has been deprecated. Use the `main` method instead. -##Use of type parameter for `water.Key` unified +## Use of type parameter for `water.Key` unified All methods accepting or returning `water.Key` have been changed to always accept or return a generic form of `Key`. For example, a signature of the method `Key#make` has been changed to `public static

Key

make()`. Clients should always use `Key` with a specific target type (e.g., `Key`, `Key`). diff --git a/h2o-docs/src/product/upgrade/Migration.md b/h2o-docs/src/product/upgrade/Migration.md index 554a7e00346..05308e063e0 100644 --- a/h2o-docs/src/product/upgrade/Migration.md +++ b/h2o-docs/src/product/upgrade/Migration.md @@ -1,4 +1,4 @@ -#Migrating to H2O 3.0 +# Migrating to H2O 3.0 We're excited about the upcoming release of the latest and greatest version of H2O, and we hope you are too! H2O 3.0 has lots of improvements, including: @@ -19,29 +19,30 @@ Overall, H2O 3.0 is more stable, elegant, and simplified, with additional capabi --- -##Algorithm Changes +## Algorithm Changes Most of the algorithms available in previous versions of H2O have been improved in terms of speed and accuracy. Currently available model types include: -###Supervised +### Supervised - **Generalized Linear Model (GLM)**: Binomial classification, multinomial classification, regression (including logistic regression) - **Distributed Random Forest (DRF)**: Binomial classification, multinomial classification, regression - **Gradient Boosting Machine (GBM)**: Binomial classification, multinomial classification, regression - **Deep Learning (DL)**: Binomial classification, multinomial classification, regression +- Naive Bayes +- Stacked Ensembles +- XGBoost -###Unsupervised +### Unsupervised - K-means - Principal Component Analysis -- Autoencoder +- Autoencoder +- Generalized Low Rank Models -There are a few algorithms that are still being refined to provide these same benefits and will be available in a future version of H2O. - -Currently, the following algorithms and associated capabilities are still in development: - -- Naïve Bayes +### Miscellaneous +- **Word2vec** Check back for updates, as these algorithms will be re-introduced in an improved form in a future version of H2O. @@ -49,13 +50,13 @@ Check back for updates, as these algorithms will be re-introduced in an improved --- -##Parsing Changes +## Parsing Changes In H2O Classic, the parser reads all the data and tries to guess the column type. In H2O 3.0, the parser reads a subset and makes a type guess for each column. In Flow, you can view the preliminary parse results in the **Edit Column Names and Types** area. To change the column type, select an option from the drop-down menu to the right of the column. H2O 3.0 can also automatically identify mixed-type columns; in H2O Classic, if one column is mixed integers or real numbers using a string, the output is blank. --- -##Web UI Changes +## Web UI Changes Our web UI has been completely overhauled with a much more intuitive interface that is similar to IPython Notebook. Each point-and-click action is translated immediately into an individual workflow script that can be saved for later interactive and offline use. As a result, you can now revise and rerun your workflows easily, and can even add comments and rich media. @@ -63,7 +64,7 @@ For more information, refer to our [Getting Started with Flow](https://github.co --- -##API Users +## API Users H2O's new Python API allows Pythonistas to use H2O in their favorite environment. Using the Python command line or an integrated development environment like IPython Notebook, H2O users can control clusters and manage massive datasets quickly. @@ -71,7 +72,7 @@ H2O's REST API is the basis for the web UI (Flow), as well as the R and Python A --- -##Java Users +## Java Users Generated Java REST classes ease REST API use by external programs running in a Java Virtual Machine (JVM). @@ -79,7 +80,7 @@ As in previous versions of H2O, users can export trained models as Java objects --- -##R Users +## R Users If you use H2O primarily in R, be aware that as a result of the improvements to the R package for H2O scripts created using previous versions (Nunes 2.8.6.2 or prior) will require minor revisions to work with H2O 3.0. @@ -93,7 +94,7 @@ There is also an [R Porting Guide](#PortingGuide) that provides a side-by-side c -#Porting R Scripts +# Porting R Scripts This document outlines how to port R scripts written in previous versions of H2O (Nunes 2.8.6.2 or prior, also known as "H2O Classic") for compatibility with the new H2O 3.0 API. When upgrading from H2O to H2O 3.0, most functions are the same. However, there are some differences that will need to be resolved when porting any scripts that were originally created using H2O to H2O 3.0. @@ -105,9 +106,9 @@ For additional assistance within R, enter a question mark before the command (fo There is also a "shim" available that will review R scripts created with previous versions of H2O, identify deprecated or renamed parameters, and suggest replacements. For more information, refer to the repo [here](https://github.com/h2oai/h2o-dev/blob/d9693a97da939a2b77c24507c8b40a5992192489/h2o-r/h2o-package/R/shim.R). -##Changes from H2O 2.8 to H2O 3.0 +## Changes from H2O 2.8 to H2O 3.0 -###`h2o.exec` +### `h2o.exec` The `h2o.exec` command is no longer supported. Any workflows using `h2o.exec` must be revised to remove this command. If the H2O 3.0 workflow contains any parameters or commands from H2O Classic, errors will result and the workflow will fail. The purpose of `h2o.exec` was to wrap expressions so that they could be evaluated in a single `\Exec2` call. For example, @@ -129,23 +130,23 @@ A String array is ["f00", "b4r"], *not* "[\"f00\", \"b4r\"]" Only string values are enclosed in double quotation marks (`"`). -###`h2o.performance` +### `h2o.performance` To access any exclusively binomial output, use `h2o.performance`, optionally with the corresponding accessor. The accessor can only use the model metrics object created by `h2o.performance`. Each accessor is named for its corresponding field (for example, `h2o.AUC`, `h2o.gini`, `h2o.F1`). `h2o.performance` supports all current algorithms except for K-Means. If you specify a data frame as a second parameter, H2O will use the specified data frame for scoring. If you do not specify a second parameter, the training metrics for the model metrics object are used. -###`xval` and `validation` slots +### `xval` and `validation` slots The `xval` slot has been removed, as `nfolds` is not currently supported. The `validation` slot has been merged with the `model` slot. -###Principal Components Regression (PCR) +### Principal Components Regression (PCR) Principal Components Regression (PCR) has also been deprecated. To obtain PCR values, create a Principal Components Analysis (PCA) model, then create a GLM model from the scored data from the PCA model. -###Saving and Loading Models +### Saving and Loading Models Saving and loading a model from R is supported in version 3.0.0.18 and later. H2O 3.0 uses the same binary serialization method as previous versions of H2O, but saves the model and its dependencies into a directory, with each object as a separate file. The `save_CV` option for available in previous versions of H2O has been deprecated, as `h2o.saveAll` and `h2o.loadAll` are not currently supported. The following commands are now supported: @@ -165,11 +166,11 @@ Saving and loading a model from R is supported in version 3.0.0.18 and later. H2 -##GBM +## GBM N-fold cross-validation and grid search are currently supported in H2O 3.0. -###Renamed GBM Parameters +### Renamed GBM Parameters The following parameters have been renamed, but retain the same functions: @@ -187,7 +188,7 @@ H2O Classic Parameter Name | H2O 3.0 Parameter Name `max.after.balance.size` | `max_after_balance_size` -###Deprecated GBM Parameters +### Deprecated GBM Parameters The following parameters have been removed: @@ -196,7 +197,7 @@ The following parameters have been removed: - `holdout.fraction`: The fraction of the training data to hold out for validation is no longer supported. - `grid.parallelism`: Specifying the number of parallel threads to run during a grid search is no longer supported. -###New GBM Parameters +### New GBM Parameters The following parameters have been added: @@ -204,7 +205,7 @@ The following parameters have been added: - `score_each_iteration`: Display error rate information after each tree in the requested set is built. - `build_tree_one_node`: Run on a single node to use fewer CPUs. -###GBM Algorithm Comparison +### GBM Algorithm Comparison H2O Classic | H2O 3.0 ------------- | ------------- @@ -247,7 +248,7 @@ H2O Classic | H2O 3.0 `grid.parallelism = 1)` | -###Output +### Output The following table provides the component name in H2O, the corresponding component name in H2O 3.0 (if supported), and the model type (binomial, multinomial, or all). Many components are now included in `h2o.performance`; for more information, refer to [(`h2o.performance`)](#h2operf). @@ -278,9 +279,9 @@ H2O Classic | H2O 3.0 | Model Type --- -##GLM +## GLM -###Renamed GLM Parameters +### Renamed GLM Parameters The following parameters have been renamed, but retain the same functions: @@ -293,7 +294,7 @@ H2O Classic Parameter Name | H2O 3.0 Parameter Name `iter.max` | `max_iterations` `epsilon` | `beta_epsilon` -###Deprecated GLM Parameters +### Deprecated GLM Parameters The following parameters have been removed: @@ -305,14 +306,14 @@ The following parameters have been removed: - `disable_line_search`: This parameter has been deprecated, as it was mainly used for testing purposes. - `max_predictors`: Stops training the algorithm if the number of predictors exceeds the specified value. (may be re-added) -###New GLM Parameters +### New GLM Parameters The following parameters have been added: - `validation_frame`: Specify the validation dataset. - `solver`: Select IRLSM or LBFGS. -###GLM Algorithm Comparison +### GLM Algorithm Comparison H2O Classic | H2O 3.0 @@ -356,7 +357,7 @@ H2O Classic | H2O 3.0 `max_predictors = -1)` | `max_active_predictors = -1)` -###Output +### Output The following table provides the component name in H2O, the corresponding component name in H2O 3.0 (if supported), and the model type (binomial, multinomial, or all). Many components are now included in `h2o.performance`; for more information, refer to [(`h2o.performance`)](#h2operf). @@ -382,9 +383,9 @@ H2O Classic | H2O 3.0 | Model Type `@model$confusion` |   | `binomial` -##K-Means +## K-Means -###Renamed K-Means Parameters +### Renamed K-Means Parameters The following parameters have been renamed, but retain the same functions: @@ -399,14 +400,14 @@ H2O Classic Parameter Name | H2O 3.0 Parameter Name **Note** In H2O, the `normalize` parameter was disabled by default. The `standardize` parameter is enabled by default in H2O 3.0 to provide more accurate results for datasets containing columns with large values. -###New K-Means Parameters +### New K-Means Parameters The following parameters have been added: - `user` has been added as an additional option for the `init` parameter. Using this parameter forces the K-Means algorithm to start at the user-specified points. - `user_points`: Specify starting points for the K-Means algorithm. -###K-Means Algorithm Comparison +### K-Means Algorithm Comparison H2O Classic | H2O 3.0 ------------- | ------------- @@ -424,7 +425,7 @@ H2O Classic | H2O 3.0   | `fold_assignment = c("AUTO", "Random", "Modulo"),`   | `keep_cross_validation_predictions = FALSE)` -###Output +### Output The following table provides the component name in H2O and the corresponding component name in H2O 3.0 (if supported). @@ -442,11 +443,11 @@ H2O Classic | H2O 3.0 --- -##Deep Learning +## Deep Learning **Note**: If the results in the confusion matrix are incorrect, verify that `score_training_samples` is equal to 0. By default, only the first 10,000 rows are included. -###Renamed Deep Learning Parameters +### Renamed Deep Learning Parameters The following parameters have been renamed, but retain the same functions: @@ -460,7 +461,7 @@ H2O Classic Parameter Name | H2O 3.0 Parameter Name `dlmodel@model$valid_class_error` | `@model$validation_metrics@$MSE` -###Deprecated DL Parameters +### Deprecated DL Parameters The following parameters have been removed: @@ -468,7 +469,7 @@ The following parameters have been removed: - `holdout_fraction`: Fraction of the training data to hold out for validation. - `dlmodel@model$best_cutoff`: This output parameter has been removed. -###New DL Parameters +### New DL Parameters The following parameters have been added: @@ -479,7 +480,7 @@ The following options for the `loss` parameter have been added: - `absolute`: Provides strong penalties for mispredictions - `huber`: Can improve results for regression -###DL Algorithm Comparison +### DL Algorithm Comparison H2O Classic | H2O 3.0 ------------- | ------------- @@ -559,7 +560,7 @@ H2O Classic | H2O 3.0   | `fold_assignment = c("AUTO", "Random", "Modulo"),`   | `keep_cross_validation_predictions = FALSE)` -###Output +### Output The following table provides the component name in H2O, the corresponding component name in H2O 3.0 (if supported), and the model type (binomial, multinomial, or all). Many components are now included in `h2o.performance`; for more information, refer to [(`h2o.performance`)](#h2operf). @@ -581,9 +582,9 @@ H2O Classic | H2O 3.0 | Model Type --- -##Distributed Random Forest +## Distributed Random Forest -###Changes to DRF in H2O 3.0 +### Changes to DRF in H2O 3.0 Distributed Random Forest (DRF) was represented as `h2o.randomForest(type="BigData", ...)` in H2O Classic. In H2O Classic, SpeeDRF (`type="fast"`) was not as accurate, especially for complex data with categoricals, and did not address regression problems. DRF (`type="BigData"`) was at least as accurate as SpeeDRF (`type="fast"`) and was the only algorithm that scaled to big data (data too large to fit on a single node). In H2O 3.0, our plan is to improve the performance of DRF so that the data fits on a single node (optimally, for all cases), which will make SpeeDRF obsolete. Ultimately, the goal is provide a single algorithm that provides the "best of both worlds" for all datasets and use cases. @@ -592,7 +593,7 @@ Please note that H2O does not currently support the ability to specify the numbe **Note**: H2O 3.0 only supports DRF. SpeeDRF is no longer supported. The functionality of DRF in H2O 3.0 is similar to DRF functionality in H2O. -###Renamed DRF Parameters +### Renamed DRF Parameters The following parameters have been renamed, but retain the same functions: @@ -610,7 +611,7 @@ H2O Classic Parameter Name | H2O 3.0 Parameter Name `nodesize` | `min_rows` -###Deprecated DRF Parameters +### Deprecated DRF Parameters The following parameters have been removed: @@ -623,13 +624,13 @@ The following parameters have been removed: - `stat.type`: This parameter was used for SpeeDRF, which is no longer supported. - `type`: This parameter was used for SpeeDRF, which is no longer supported. -###New DRF Parameters +### New DRF Parameters The following parameter has been added: - `build_tree_one_node`: Run on a single node to use fewer CPUs. -###DRF Algorithm Comparison +### DRF Algorithm Comparison H2O Classic | H2O 3.0 ------------- | ------------- @@ -673,7 +674,7 @@ H2O Classic | H2O 3.0 `type = "fast")` | -###Output +### Output The following table provides the component name in H2O, the corresponding component name in H2O 3.0 (if supported), and the model type (binomial, multinomial, or all). Many components are now included in `h2o.performance`; for more information, refer to [(`h2o.performance`)](#h2operf). @@ -700,7 +701,7 @@ H2O Classic | H2O 3.0 | Model Type `@model$max_per_class_err` | currently replaced by `@model$training_metrics@metrics$thresholds_and_metric_scores$min_per_class_correct` | `binomial` -##Github Users +## Github Users All users who pull directly from the H2O classic repo on Github should be aware that this repo will be renamed. To retain access to the original H2O (2.8.6.2 and prior) repository: @@ -708,29 +709,29 @@ All users who pull directly from the H2O classic repo on Github should be aware This is the easiest way to change your local repo and is recommended for most users. -0. Enter `git remote -v` to view a list of your repositories. -0. Copy the address your H2O classic repo (refer to the text in brackets below - your address will vary depending on your connection method): +1. Enter `git remote -v` to view a list of your repositories. +2. Copy the address your H2O classic repo (refer to the text in brackets below - your address will vary depending on your connection method): ``` H2O_User-MBP:h2o H2O_User$ git remote -v origin https://{H2O_User@github.com}/h2oai/h2o.git (fetch) origin https://{H2O_User@github.com}/h2oai/h2o.git (push) ``` -0. Enter `git remote set-url origin {H2O_User@github.com}:h2oai/h2o-2.git`, where `{H2O_User@github.com}` represents the address copied in the previous step. +3. Enter `git remote set-url origin {H2O_User@github.com}:h2oai/h2o-2.git`, where `{H2O_User@github.com}` represents the address copied in the previous step. **The more complicated way** This method involves editing the Github config file and should only be attempted by users who are confident enough with their knowledge of Github to do so. -0. Enter `vim .git/config`. -0. Look for the `[remote "origin"]` section: +1. Enter `vim .git/config`. +2. Look for the `[remote "origin"]` section: ``` [remote "origin"] url = https://H2O_User@github.com/h2oai/h2o.git fetch = +refs/heads/*:refs/remotes/origin/* ``` -0. In the `url =` line, change `h2o.git` to `h2o-2.git`. -0. Save the changes. +3. In the `url =` line, change `h2o.git` to `h2o-2.git`. +4. Save the changes. The latest version of H2O is stored in the `h2o-3` repository. All previous links to this repo will still work, but if you would like to manually update your Github configuration, follow the instructions above, replacing `h2o-2` with `h2o-3`. diff --git a/h2o-docs/src/product/upgrade/PressRelease.md b/h2o-docs/src/product/upgrade/PressRelease.md index 61509ed3ac3..adf5a191b51 100644 --- a/h2o-docs/src/product/upgrade/PressRelease.md +++ b/h2o-docs/src/product/upgrade/PressRelease.md @@ -1,4 +1,4 @@ -#H2O 3.0 is here! +# H2O 3.0 is here! The new version of H2O offers a single integrated and tested platform for enterprise and open-source use, enhanced usability through a new web user interface (UI) with embeddable workflows, elegant APIs, and direct integration for Python and Sparkling Water. diff --git a/h2o-docs/src/product/upgrade/PythonParity.md b/h2o-docs/src/product/upgrade/PythonParity.md index 4305aafb8d3..5c859bae126 100644 --- a/h2o-docs/src/product/upgrade/PythonParity.md +++ b/h2o-docs/src/product/upgrade/PythonParity.md @@ -296,7 +296,7 @@ This group includes: -####Summary Group +#### Summary Group This group includes: @@ -318,7 +318,7 @@ This group includes: |`all`| `all`| |`any`|`any`| -####Non-Group Generic +#### Non Group Generic This group includes: diff --git a/h2o-docs/src/product/upgrade/RChanges.md b/h2o-docs/src/product/upgrade/RChanges.md index e6577dcb274..50e5927e1d5 100644 --- a/h2o-docs/src/product/upgrade/RChanges.md +++ b/h2o-docs/src/product/upgrade/RChanges.md @@ -1,15 +1,15 @@ -#R Interface Improvements for H2O +# R Interface Improvements for H2O Recent improvements in the R wrapper for H2O may cause previously written R scripts to be inoperable. This document describes these changes and provides guidelines on updating scripts for compatibility. -##H2O Connection Object +## H2O Connection Object The H2O connection object (`conn`) has been removed from nearly all calls. The `conn` object is still used in the `h2o.clusterIsUp` command. Any `conn` references for commands other than `h2o.clusterIsUp` must be removed from scripts to ensure compatibility. -##Changes to `apply` +## Changes to `apply` The data shape returned by `apply` is now identical to the default behavior in R. Any column-wide changes produce column-wide results. @@ -17,7 +17,7 @@ For example, in previous versions, if `apply` on `MARGIN` was equal to `2`, then To revert to the previous behavior, use the transpose function using the R command `t`. -##Temp Management +## Temp Management For users who regularly remove the temporary data frames and keys manually, the temp management rules have been improved in the following ways: @@ -32,10 +32,10 @@ For users who regularly remove the temporary data frames and keys manually, the - If your cluster is running low on memory, run an R GC cycle to delete temporary data frames and keys -##S4 to S3 +## S4 to S3 The internal H2O object, which was previously an S4 object, is now an S3 object. You must use S3 operations to access objects (instead of S4). The risk of overloading depends on whether the package overloads the existing package type. -##`frame_id` to `id` +## `frame_id` to `id` The `frame_id` property has been renamed to `id`. This property is used in the `h2o.getFrame` command. \ No newline at end of file diff --git a/h2o-docs/src/product/upgrade/Rdoc.md b/h2o-docs/src/product/upgrade/Rdoc.md index 475d151377b..a33bf989a77 100644 --- a/h2o-docs/src/product/upgrade/Rdoc.md +++ b/h2o-docs/src/product/upgrade/Rdoc.md @@ -1,16 +1,18 @@ -#Intro to using H2O-Dev from R with data munging (for PUBDEV-562) +# Intro to using H2O-Dev from R with data munging (for PUBDEV-562) + +>**Note**: This topic is no longer being maintained. Refer to the [R Booklet](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/RBooklet.pdf) for the most up-to-date documentation. We have the reference doc for the H2O R binding, but we regularly get questions from new users asking about which parts of R are supported, in particular regarding data munging. A 15-20 page intro doc would be really useful. Perhaps this should be a new booklet in the small yellow book series. It should give an overview of: -0. how the big data is kept in the cluster and manipulated from R via references, +1. how the big data is kept in the cluster and manipulated from R via references, -1. how to move data back and forth between data in R , +2. how to move data back and forth between data in R , -2. what operations are implemented in the H2O back end, +3. what operations are implemented in the H2O back end, -3. example scripts which include simple data munging (frame manipulation via R expressions and ddply), perhaps based on the CityBike example (sans weather join) and Alex's examples. +4. example scripts which include simple data munging (frame manipulation via R expressions and ddply), perhaps based on the CityBike example (sans weather join) and Alex's examples. Per Ray, this doc should also include: @@ -21,7 +23,7 @@ explanation of how it works standard data prep -#What is H2O? +# What is H2O? H2O is fast, scalable, open-source machine learning and deep learning for Smarter Applications. With H2O enterprises like PayPal, Nielsen, Cisco, and others can use all of their data without sampling and get accurate predictions faster. Advanced algorithms, like Deep Learning, Boosting, and Bagging Ensembles are readily available for application designers to build smarter applications through elegant APIs. Some of our earliest customers have built powerful domain-specific predictive engines for Recommendations, Customer Churn, Propensity to Buy, Dynamic Pricing, and Fraud Detection for the Insurance, Healthcare, Telecommunications, AdTech, Retail, and Payment Systems. @@ -31,37 +33,37 @@ H2O implements almost all common machine learning algorithms, such as generalize H2O is nurturing a grassroots movement of physicists, mathematicians, computer and data scientists to herald the new wave of discovery with data science. Academic researchers and Industrial data scientists collaborate closely with our team to make this possible. Stanford university giants Stephen Boyd, Trevor Hastie, and Rob Tibshirani advise the H2O team to build scalable machine learning algorithms. With hundreds of meetups over the past two years, H2O has become a growing word-of-mouth phenomenon amongst the data community, now implemented by 12,000+ users and deployed in 2000+ corporations using R, Python, Hadoop and Spark. -#Intro +# Intro how the big data is kept in the cluster and manipulated from R via references what operations are implemented in the H2O back end -#Installation +# Installation -###Installing R or R Studio +### Installing R or R Studio To download R: -0. Go to [http://cran.r-project.org/mirrors.html](http://cran.r-project.org/mirrors.html). -0. Select your closest local mirror. -0. Select your operating system (Linux, OS X, or Windows). -0. Depending on your OS, download the appropriate file, along with any required packages. -0. When the download is complete, unzip the file and install. +1. Go to [http://cran.r-project.org/mirrors.html](http://cran.r-project.org/mirrors.html). +2. Select your closest local mirror. +3. Select your operating system (Linux, OS X, or Windows). +4. Depending on your OS, download the appropriate file, along with any required packages. +5. When the download is complete, unzip the file and install. To download R Studio: -0. Go to [http://www.rstudio.com/products/rstudio/](http://www.rstudio.com/products/rstudio/). -0. Select your deployment type (desktop or server). -0. Download the file. -0. When the download is complete, unzip the file and install. +1. Go to [http://www.rstudio.com/products/rstudio/](http://www.rstudio.com/products/rstudio/). +2. Select your deployment type (desktop or server). +3. Download the file. +4. When the download is complete, unzip the file and install. -#H2O Initialization +# H2O Initialization -0. Go to [h2o.ai/downloads](http://h2o.ai/downloads). -0. Under **Download H2O**, select a build. The "bleeding edge" build contains the latest changes, while the "latest stable release" may be more reliable. -0. Click the **Install in R** tab above the **Download H2O** button. -0. Copy and paste the commands into R or R Studio, one line at a time. +1. Go to [h2o.ai/downloads](http://h2o.ai/downloads). +2. Under **Download H2O**, select a build. The "bleeding edge" build contains the latest changes, while the "latest stable release" may be more reliable. +3. Click the **Install in R** tab above the **Download H2O** button. +4. Copy and paste the commands into R or R Studio, one line at a time. The lines are reproduced below; however, you should not copy and paste them, as the required version number has been replaced with asterisks (*). Refer to the [Downloads page](http://h2o.ai/downloads) for the latest version number. @@ -89,7 +91,7 @@ The lines are reproduced below; however, you should not copy and paste them, as You can also enter `install.packages("h2o")` in R to load the latest H2O R package from CRAN. -###Making a Build from Source Code +### Making a Build from Source Code The R package is build as part of the standard build process. In the top-level `h2o-3` directory, use `./gradlew build`. @@ -100,16 +102,16 @@ To build the R component by itself: The build output is located a CRAN-like layout in the R directory. -####Installation from the command line +#### Installation from the command line -0. Navigate to the top-level `h2o-3` directory: `cd ~/h2o-3`. -0. Install the H2O package for R: `R CMD INSTALL h2o-r/R/src/contrib/h2o_****.tar.gz` +1. Navigate to the top-level `h2o-3` directory: `cd ~/h2o-3`. +2. Install the H2O package for R: `R CMD INSTALL h2o-r/R/src/contrib/h2o_****.tar.gz` **Note**: Do not copy and paste the command above. You must replace the asterisks (*) with the current H2O .tar version number. Look in the `h2o-3/h2o-r/R/src/contrib/` directory for the version number. ### Installation from within R -0. Detach any currently loaded H2O package for R. +1. Detach any currently loaded H2O package for R. `if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }` ``` @@ -117,7 +119,7 @@ The build output is located a CRAN-like layout in the R directory. (as ‘lib’ is unspecified) ``` -0. Remove any previously installed H2O package for R. +2. Remove any previously installed H2O package for R. `if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }` @@ -126,7 +128,7 @@ The build output is located a CRAN-like layout in the R directory. (as ‘lib’ is unspecified) ``` -0. Install the dependencies for H2O. +3. Install the dependencies for H2O. **Note**: This list may change as new capabilities are added to H2O. The commands are reproduced below, but we strongly recommend visiting the H2O download page at [h2o.ai/download](http://h2o.ai/download) for the most up-to-date list of dependencies. @@ -141,7 +143,7 @@ The build output is located a CRAN-like layout in the R directory. if (! ("utils" %in% rownames(installed.packages()))) { install.packages("utils") } ``` -0. Install the H2O R package from your build directory. +1. Install the H2O R package from your build directory. `install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/master/****/R")))` **Note**: Do not copy and paste the command above. You must replace the asterisks (*) with the current H2O build number. Refer to the H2O download page at [h2o.ai/download](http://h2o.ai/download) for latest build number. @@ -213,14 +215,14 @@ Note: As started, H2O is limited to the CRAN default of 2 CPUs. > localH2O = h2o.init(nthreads = -1) ``` -##Munging operations in R: +## Munging operations in R: -###Overview: +### Overview: Operating on an `H2OFrame` object triggers the rollup of the expression to be executed, but the expression itself is not evaluated. Instead, an AST is built from the R expression using R's built-in parser, which handles operator precedence. In the case of assignment, the AST is stashed into the variable in the assignment. The AST is bound to an R variable as a promise to evaluate the expression on demand. When evaluation is forced, the AST is walked, converted to JSON, and shipped over to H2O. The result returned by H2O is a key pointing to the newly-created frame. Depending on the methods used, the results may not be an H2OFrame return type. Any extra preprocessing of data returned by H2O is discussed in each instance, as it varies from method to method. -###What's implemented? +### What's implemented? Many of R's generic S3 methods can be combined with H2OFrame objects so that the result is coerced to an object of the appropriate type (typically an H2OFrame object). To view a list of R's generic methods, use `getGenerics()`. A call to `showMethods(classes="H2OFrame")` displays a list of permissible operations with H2OFrame objects. S3 methods are divided into four groups: - Math @@ -236,9 +238,9 @@ With the exception of Complex, H2OFrame methods fall into these categories as we - Summary -###List: +### List: -####Ops Group +#### Ops Group This group includes: @@ -258,7 +260,7 @@ This group includes: |`&`| `∣`| | | -####Math Group +#### Math Group This group includes: @@ -295,7 +297,7 @@ This group includes: -####Summary Group +#### Summary Group This group includes: @@ -313,7 +315,7 @@ This group includes: |`sum`|`all`| |`any`| -####Non-Group Generic +#### Non-Group Generic This group includes: @@ -356,25 +358,25 @@ This group includes: -#Data Prep in R +# Data Prep in R standard data prep -#Data Manipulation in R +# Data Manipulation in R how to move data back and forth between data in R slicing creating new columns -#Examples/Demos +# Examples/Demos -#Support +# Support -Users of the H2O package may submit general inquiries and bug reports to the H2O.ai support address, [support@h2oai.com](mailto:support@h2oai.com). Alternatively, specific bugs or issues may be filed to the H2O JIRA, [https://0xdata.atlassian.net](https://0xdata.atlassian.net). +Users of the H2O package may submit general inquiries and bug reports using the "h2o" tag on [Stack Overflow](https://stackoverflow.com/questions/tagged/h2o). Alternatively, specific bugs or issues may be filed to the H2O JIRA, [https://0xdata.atlassian.net](https://0xdata.atlassian.net). -#References +# References -#Appendix +# Appendix (commands) diff --git a/h2o-docs/src/product/upgrade/Upgrade.md b/h2o-docs/src/product/upgrade/Upgrade.md index c8d6fec7acd..18a24230522 100644 --- a/h2o-docs/src/product/upgrade/Upgrade.md +++ b/h2o-docs/src/product/upgrade/Upgrade.md @@ -1,6 +1,6 @@ -#Upgrading to H2O 3.0 +# Upgrading to H2O 3.0 -##Why Upgrade? +## Why Upgrade? H2O 3.0 represents our latest iteration of H2O. It includes many improvements, such as a simplified architecture, faster and more accurate algorithms, and an interactive web UI. @@ -8,37 +8,40 @@ As of May 15th, 2015, this version will supersede the previous version of H2O. S For a comparison of H2O and H2O 3.0, please refer to this document. -###Python Support +### Python Support Python is only supported on the latest version of H2O. For more information, refer to the Python installation instructions. -###Sparkling Water Support +### Sparkling Water Support Sparkling Water is only supported with H2O 3.0. For more information, refer to the Sparkling Water repo. -##Supported Algorithms +## Supported Algorithms H2O 3.0 will soon provide feature parity with previous versions of H2O. Currently, the following algorithms are supported: -###Supervised +### Supervised - **Generalized Linear Model (GLM)**: Binomial classification, multinomial classification, regression (including logistic regression) - **Distributed Random Forest (DRF)**: Binomial classification, multinomial classification, regression - **Gradient Boosting Machine (GBM)**: Binomial classification, multinomial classification, regression - **Deep Learning (DL)**: Binomial classification, multinomial classification, regression +- Naive Bayes +- Stacked Ensembles +- XGBoost -###Unsupervised +### Unsupervised - K-means - Principal Component Analysis -- Autoencoder - +- Autoencoder +- Generalized Low Rank Models -###Still In Testing +### Miscellaneous -- Naive Bayes +- **Word2vec** -##How to Update R Scripts +## How to Update R Scripts Due to the numerous enhancements to the H2O package for R to make it more consistent and simplified, some parameters have been renamed or deprecated. From c6b602661aac072416d5ea004c489b24f3c0ebf8 Mon Sep 17 00:00:00 2001 From: angela0xdata Date: Fri, 23 Jun 2017 07:42:35 -0700 Subject: [PATCH 2/2] minor update Fixed heading underlining. --- h2o-docs/src/product/flow.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/h2o-docs/src/product/flow.rst b/h2o-docs/src/product/flow.rst index 8b0a42b58fa..f00bb4a8ca5 100644 --- a/h2o-docs/src/product/flow.rst +++ b/h2o-docs/src/product/flow.rst @@ -1421,7 +1421,7 @@ and selecting **List All Predictions**. Interpreting the Gains/Lift Chart -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The Gains/Lift chart evaluates the prediction ability of a binary classification model. The chart is computed using the prediction probability and the true response (class) labels. The accuracy of the classification model for a random sample is evaluated according to the results when the model is and is not used.