This deck will provide you an information related to data preparation, training, testing and validation of data used in Machine Learning using Apache SystemML. As well as it will provide Descriptive statistics -- Univariate Statistics, Bivariate Statistics and Stratified Statistics.
8. Signature of transform()
§ Invocation 1:
§ Resulting metadata: # distinct values in categorical columns, list of distinct values with their
recoded IDs, number of bins, bin width, etc.
§ An existing transformation can be applied to new data using the metadata generated in an
earlier invocation
§ Invocation 2:
8
output = transform (target = input,
spec = specification,
transformPath = "/path/to/metadata“);
output = transform (target = input,
transformPath = "/path/to/new_metadata“
applyTransformPath = "/path/to/metadata“);
11. Pre-Processing Training and
Testing Data
Training phase
Testing phase
11
Train = read ("/user/ml/trainset.csv");
Spec = read("/user/ml/tf.spec.json“, data_type = "scalar",
value_type = "String");
trainD = transform (target = Train,
transformSpec = Spec,
transformPath = "/user/ml/train_tf_metadata");
# Build a predictive model using trainD
...
Test = read ("/user/ml/testset.csv");
testD = transform (target = Test,
transformPath = "/user/ml/test_tf_metadata",
applyTransformPath = "/user/ml/train_tf_metdata");
# Test the model using testD
...
12. Cross Validation
K-fold Cross Validation:
1. Shuffle the data points
2. Divide the data points into 𝑘 folds of (roughly)
the same size
3. For 𝑖 = 1, … , 𝑘:
• Train each model on all the data points that
do not belong to fold 𝑖
• Test each model on all the examples in fold 𝑖
and compute the test error
4. Select the model with the minimum average test
over all 𝑘 folds
5. (Train the winning model on all the data points)
12
Testing Training
Example: 𝑘 = 5
14. Univariate Statistics
14
Row Name of Statistic Scale Category
1 Minimum +
2 Maximum +
3 Range +
4 Mean +
5 Variance +
6 Standard deviation +
7 Standard error of mean +
8 Coefficient of variation +
9 Skewness +
10 Kurtosis +
11 Standard error of skewness +
12 Standard error of Kurtosis +
13 Median +
14 Intequartilemean +
15 Number of categories +
16 Mode +
17 Number of modes +
Central tendency measures
Dispersion measures
Shape measures
Categorical measures
20. Nominal-vs-Scale Statistics
𝐹 statistic
§ A measure for the strength of association between a categorical feature and a scale
feature
§ Assumptions (𝑥 categorical, 𝑦 scale):
§ 𝑦 ~ 𝑁𝑜𝑟𝑚𝑎𝑙 𝜇, 𝜎)
- same variance for all 𝑥
§ 𝑥 has small value domain with large frequency counts, 𝑥A non-random
§ All records are iid
§ Under independence assumption 𝐹 distributed approximately 𝐹(𝑘 − 1, 𝑛 − 𝑘)
20
𝐹 =
∑ 𝑓𝑟𝑒𝑞 𝑥 𝑦B 𝑥 − 𝑦k )/(𝑘 − 1)5
∑ 𝑦A − 𝑦B 𝑥A
)/(𝑛 − 𝑘)C
AD0
=
𝜂)(𝑛 − 𝑘)
1 − 𝜂)(𝑘 − 1)
ESS: Explained Sum of Squares
RSS
Degrees of freedom
Degrees of freedom