Page tree
Skip to end of metadata
Go to start of metadata
Snap type:Transform

This Snap executes a Spark Python script.

Prerequisites:This Snap only works with a Hadooplex and Spark configured.
Support and limitations:
  • Ultra pipelines: Supported for use in Ultra Pipelines.
  • Spark (Deprecated): Spark must be configured. Not supported for use in a Spark pipeline.
  • Snaplex: Requires a Hadooplex

Accounts are not used with this Snap.


InputThis Snap has at most one document input view.
OutputThis Snap has at most one document output view.
ErrorThis Snap has at most one document error view and produces zero or more documents in the view.



Required. The name for the Snap. You can modify this to be more specific, especially if you have more than one of the same Snap in your pipeline.

Required. The "base" path for Input and Output paths. The suggestion only shows the directories.

The first version of Spark Script Snap only supports read from and write to HDFS, so the default value is the HDFS scheme.

Input path

Required. Relative path, where Directory is the root.
The suggestion shows all the files and directories in Directory property.

Default value: [None]

Output path

Required. Relative path, where Directory is the root.
The suggestion shows all the files and directories in "Directory" property.

Default value:  [None]

Spark mode

Required. Specify the type of cluster manager to allocate resources across applications. Options available include:

  • Standalone: a simple cluster manager included with Spark that makes it easy to set up a cluster. If you use Standalone mode, put the Master URL in the master field and setMaster inside your Spark Script, or leave it as blank.
  • Hadoop YARN: the resource manager in Hadoop. If you use YARN mode, you don't need to set Master.

YARN mode cannot use environment variables. If you try to get environment variables inside the Python script, you will get Null point error. Comment out the following lines in the template script:

appName = os.environ.get("app")
input = os.environ.get("input")
output = os.environ.get("output")
master = os.environ.get("master")

If you want to set appName, use String directly. For example, replace the red color appName line as:

# For example, just set appName.
   conf = SparkConf().setAppName(appName)
# Pass configuration to Spark context.
   sc = SparkContext(conf = conf)

Use the real path instead of variables input and output. For example, replace the following red color input as String:“hdfs://localhost:10000/data/input”.

# In this example a text file is read in, split up by line,
# and the usage of words is counted.
# The output contains each word in the file followed by the number of times it was used.
  text_file = sc.textFile(input)
  counts = text_file.flatMap(lambda line: line.split()) \
      .map(lambda word: (word, 1)) \
      .reduceByKey(lambda a, b: a + b)

For more information about Spark cluster mode, see

Spark master

Set the Spark master URL, in the form of spark://HOST:PORT.

Default value: [None]

Spark home

Spark submit home path

Default value: [None]

App name

Set Spark App name

Default value: SparkScriptSnap

Memory size

Set Spark Executor Memory Size

Default value: 2g

Edit Script

Required. This property enables you to edit a script within the Snap instead of through an external file.

From this page, you can export the script to a file in a project, import a script, or generate a template for the selected Scripting Language.
Default value: A skeleton for the chosen scripting language.  You can click the Generate Template button to regenerate the skeleton.


If the absolute path for input and output are:

  • hdfs:localhost:10000/data/input
  • hdfs:localhost:10000/data/output


  • Directory to: hdfs:localhost:10000/data
  • Input path to: input
  • Output path to: output

Related Information

Snap History

  • No labels