使用 glmUse glm

glm 拟合通用线性模型,这是类似于 R 的 glm()。glm fits a Generalized Linear Model, similar to R’s glm().

语法glm(formula, data, family...)Syntax : glm(formula, data, family...)

参数Parameters :

  • formula:要拟合的模型的符号说明,例如:ResponseVariable ~ Predictor1 + Predictor2formula: Symbolic description of model to be fitted, for eg: ResponseVariable ~ Predictor1 + Predictor2. 支持的运算符:~+-.Supported operators: ~, +, -, and .
  • data:任何 SparkDataFramedata: Any SparkDataFrame
  • family:字符串 "gaussian"(用于线性回归)或 "binomial"(用于逻辑回归)family: String, "gaussian" for linear regression or "binomial" for logistic regression
  • lambda:数值,正则化参数lambda: Numeric, Regularization parameter
  • alpha:数值,弹性网络混合参数alpha: Numeric, Elastic-net mixing parameter

输出 :MLlib PipelineModelOutput : MLlib PipelineModel

本教程介绍如何对钻石数据集执行线性回归和逻辑回归。This tutorial shows how to perform linear and logistic regression on the diamonds dataset.

加载钻石数据并将其拆分为训练集和测试集Load diamonds data and split into training and test sets

require(SparkR)

# Read diamonds.csv dataset as SparkDataFrame
diamonds <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
                  source = "com.databricks.spark.csv", header="true", inferSchema = "true")
diamonds <- withColumnRenamed(diamonds, "", "rowID")

# Split data into Training set and Test set
trainingData <- sample(diamonds, FALSE, 0.7)
testData <- except(diamonds, trainingData)

# Exclude rowIDs
trainingData <- trainingData[, -1]
testData <- testData[, -1]

print(count(diamonds))
print(count(trainingData))
print(count(testData))
head(trainingData)

使用 glm() 训练线性回归模型Train a linear regression model using glm()

本部分介绍了如何通过使用训练数据训练线性回归模型来根据钻石的特征预测钻石的价格。This section shows how to predict a diamond’s price from its features by training a linear regression model using the training data.

将分类特征(切割 - 理想、高级、非常好...)和连续特征(深度,克拉)混合在一起。There are mix of categorical features (cut - Ideal, Premium, Very Good…) and continuous features (depth, carat). 在内部,SparkR 会自动对此类特征执行独热编码,这样便不必手动执行此操作。Under the hood, SparkR automatically performs one-hot encoding of such features so that it does not have to be done manually.

# Family = "gaussian" to train a linear regression model
lrModel <- glm(price ~ ., data = trainingData, family = "gaussian")

# Print a summary of the trained model
summary(lrModel)

对测试数据使用 predict(),以查看模型对新数据的处理情况。Use predict() on the test data to see how well the model works on new data.

语法predict(model, newData)Syntax : predict(model, newData)

参数Parameters :

  • model:MLlib 模型model: MLlib model
  • newData:SparkDataFrame,通常为你的测试集newData: SparkDataFrame, typically your test set

输出 :SparkDataFrameOutput : SparkDataFrame

# Generate predictions using the trained model
predictions <- predict(lrModel, newData = testData)

# View predictions against mpg column
display(select(predictions, "price", "prediction"))

评估模型。Evaluate the model.

errors <- select(predictions, predictions$price, predictions$prediction, alias(predictions$price - predictions$prediction, "error"))
display(errors)

# Calculate RMSE
head(select(errors, alias(sqrt(sum(errors$error^2 , na.rm = TRUE) / nrow(errors)), "RMSE")))

使用 glm() 训练逻辑回归模型Train a logistic regression model using glm()

本部分说明如何在同一数据集上创建逻辑回归,以根据钻石的某些特征预测钻石的切割。This section shows how to create a logistic regression on the same dataset to predict a diamond’s cut based on some of its features.

MLlib 中的逻辑回归只支持二元分类。Logistic regression in MLlib supports only binary classification. 若要在此示例中测试算法,请对数据进行子集处理以仅使用 2 个标签。To test the algorithm in this example, subset the data to work with only 2 labels.

# Subset data to include rows where diamond cut = "Premium" or diamond cut = "Very Good"
trainingDataSub <- subset(trainingData, trainingData$cut %in% c("Premium", "Very Good"))
testDataSub <- subset(testData, testData$cut %in% c("Premium", "Very Good"))
# Family = "binomial" to train a logistic regression model
logrModel <- glm(cut ~ price + color + clarity + depth, data = trainingDataSub, family = "binomial")

# Print summary of the trained model
summary(logrModel)
# Generate predictions using the trained model
predictionsLogR <- predict(logrModel, newData = testDataSub)

# View predictions against label column
display(select(predictionsLogR, "label", "prediction"))

评估模型。Evaluate the model.

errorsLogR <- select(predictionsLogR, predictionsLogR$label, predictionsLogR$prediction, alias(abs(predictionsLogR$label - predictionsLogR$prediction), "error"))
display(errorsLogR)