pipeasy_spark package¶

Submodules¶

pipeasy_spark.convenience module¶

pipeasy_spark.convenience.build_default_pipeline(dataframe, exclude_columns=())[source]¶

Build simple transformation pipeline (untrained) for the given dataframe.

By defaults numeric columns are processed with StandardScaler and string columns are processed with StringIndexer + OneHotEncoderEstimator

dataframe: pyspark.sql.Dataframe

only the schema of the dataframe is used, not actual data.

exclude_columns: list of str

name of columns for which we want no transformation to apply.

pipeline: pyspark.ml.Pipeline instance (untrained)

pipeasy_spark.convenience.build_pipeline_by_dtypes(dataframe, exclude_columns=(), string_transformers=(), numeric_transformers=())[source]¶

Build simple transformation pipeline (untrained) for the given dataframe.

dataframe: pyspark.sql.Dataframe: only the schema of the dataframe is used, not actual data.
exclude_columns: list of str: name of columns for which we want no transformation to apply.
string_transformers: list of transformer instances: The successive transformations that will be applied to string columns Each element is an instance of a pyspark.ml.feature transformer class.
numeric_transformers: list of transformer instances: The successive transformations that will be applied to numeric columns Each element is an instance of a pyspark.ml.feature transformer class.

pipeline: pyspark.ml.Pipeline instance (untrained)

pipeasy_spark.core module¶

pipeasy_spark.core.build_pipeline(column_transformers)[source]¶

Create a dataframe transformation pipeline.

The created pipeline can be used to apply successive transformations on a spark dataframe. The transformations are intended to be applied per column.

>>> df = titanic.select('Survived', 'Sex', 'Age').dropna()
>>> df.show(2)
+--------+------+----+
|Survived|   Sex| Age|
+--------+------+----+
|       0|  male|22.0|
|       1|female|38.0|
+--------+------+----+
>>> pipeline = build_pipeline({
        # 'Survived' : this variable is not modified, it can also be omitted from the dict
        'Survived': [],
        'Sex': [StringIndexer(), OneHotEncoderEstimator(dropLast=False)],
        # 'Age': a VectorAssembler must be applied before the StandardScaler
        # as the latter only accepts vectors as input.
        'Age': [VectorAssembler(), StandardScaler()]
    })
>>> trained_pipeline = pipeline.fit(df)
>>> trained_pipeline.transform(df).show(2)
+--------+-------------+--------------------+
|Survived|          Sex|                 Age|
+--------+-------------+--------------------+
|       0|(2,[0],[1.0])|[1.5054181442954726]|
|       1|(2,[1],[1.0])| [2.600267703783089]|
+--------+-------------+--------------------+

column_transformers: dict(str -> list)

key (str): column name; value (list): transformer instances (typically instances of pyspark.ml.feature transformers)

pipeline: a pyspark.ml.Pipeline instance

pipeasy_spark.transformers module¶

class pipeasy_spark.transformers.ColumnDropper(inputCols=None)[source]¶

Bases: pyspark.ml.base.Transformer, pyspark.ml.param.shared.HasInputCols

Transformer to drop several columns from a dataset.

transform(dataset)[source]¶

Transforms the input dataset with optional parameters.

Parameters:	dataset – input dataset, which is an instance of `pyspark.sql.DataFrame` params – an optional param map that overrides embedded params.
Returns:	transformed dataset

New in version 1.3.0.

class pipeasy_spark.transformers.ColumnRenamer(inputCol=None, outputCol=None)[source]¶

Bases: pyspark.ml.base.Transformer, pyspark.ml.param.shared.HasInputCol, pyspark.ml.param.shared.HasOutputCol

Transformer to rename a column to another.

transform(dataset)[source]¶

Transforms the input dataset with optional parameters.

Parameters:	dataset – input dataset, which is an instance of `pyspark.sql.DataFrame` params – an optional param map that overrides embedded params.
Returns:	transformed dataset

New in version 1.3.0.

pipeasy_spark.transformers.set_transformer_in_out(transformer, inputCol, outputCol)[source]¶: Set input and output column(s) of a transformer instance.

Module contents¶

Top-level package for pipeasy-spark.

The pipeasy-spark package provides a set of convenience functions that make it easier to map each column of a Spark dataframe (or subsets of columns) to user-specified transformations.

pipeasy_spark.build_pipeline(column_transformers)[source]¶

Create a dataframe transformation pipeline.

The created pipeline can be used to apply successive transformations on a spark dataframe. The transformations are intended to be applied per column.

>>> df = titanic.select('Survived', 'Sex', 'Age').dropna()
>>> df.show(2)
+--------+------+----+
|Survived|   Sex| Age|
+--------+------+----+
|       0|  male|22.0|
|       1|female|38.0|
+--------+------+----+
>>> pipeline = build_pipeline({
        # 'Survived' : this variable is not modified, it can also be omitted from the dict
        'Survived': [],
        'Sex': [StringIndexer(), OneHotEncoderEstimator(dropLast=False)],
        # 'Age': a VectorAssembler must be applied before the StandardScaler
        # as the latter only accepts vectors as input.
        'Age': [VectorAssembler(), StandardScaler()]
    })
>>> trained_pipeline = pipeline.fit(df)
>>> trained_pipeline.transform(df).show(2)
+--------+-------------+--------------------+
|Survived|          Sex|                 Age|
+--------+-------------+--------------------+
|       0|(2,[0],[1.0])|[1.5054181442954726]|
|       1|(2,[1],[1.0])| [2.600267703783089]|
+--------+-------------+--------------------+

column_transformers: dict(str -> list)

key (str): column name; value (list): transformer instances (typically instances of pyspark.ml.feature transformers)

pipeline: a pyspark.ml.Pipeline instance

pipeasy_spark.build_pipeline_by_dtypes(dataframe, exclude_columns=(), string_transformers=(), numeric_transformers=())[source]¶

Build simple transformation pipeline (untrained) for the given dataframe.

dataframe: pyspark.sql.Dataframe: only the schema of the dataframe is used, not actual data.
exclude_columns: list of str: name of columns for which we want no transformation to apply.
string_transformers: list of transformer instances: The successive transformations that will be applied to string columns Each element is an instance of a pyspark.ml.feature transformer class.
numeric_transformers: list of transformer instances: The successive transformations that will be applied to numeric columns Each element is an instance of a pyspark.ml.feature transformer class.

pipeline: pyspark.ml.Pipeline instance (untrained)

pipeasy_spark.build_default_pipeline(dataframe, exclude_columns=())[source]¶

Build simple transformation pipeline (untrained) for the given dataframe.

By defaults numeric columns are processed with StandardScaler and string columns are processed with StringIndexer + OneHotEncoderEstimator

dataframe: pyspark.sql.Dataframe

only the schema of the dataframe is used, not actual data.

exclude_columns: list of str

name of columns for which we want no transformation to apply.

pipeline: pyspark.ml.Pipeline instance (untrained)