12 Julia Scientific Computing

12 Julia Scientific Computing

Scientific Computing

Essential DataFrames for Scientific Computing

Basic operation of DataFrames

It is very similar to the usage of pandas in Python. I believe that friends who have used Pandas should have no pressure to get started.

DataFrame definition

Create a new DataFrame and add 4 columns of content

using DataFrames
df1 = DataFrame()
df1[:clo1] = Array([1.0,2.0,3.0])
df1[:clo2] = Array([4.0,5.0,6.0])
df1[:clo3] = Array([7.0,8.0,9.0])
df1[:ID] = Array(['a','b','c'])
show(df1)
>>3×4 DataFrame
│ Row │ clo1 │ clo2 │ clo3 │ ID │
│ │ Float64 │ Float64 │ Float64 │ Char │
├─────┼─────────┼─────────┼─────────┼─────┤
│ 1 │ 1.0 │ 4.0 │ 7.0 │'a' │
│ 2 │ 2.0 │ 5.0 │ 8.0 │'b' │
│ 3 │ 3.0 │ 6.0 │ 9.0 │'c' │

If no column name is specified, the default isx1,x2...

df2 = DataFrame(rand(5,6))

image

Specify the content directly when the DataFrame is defined

df3 = DataFrame([collect(1:3),collect(4:6)], [:A, :B])
>> AB
  Int64 Int64
1 1 4
2 2 5
3 3 6

DataFrame merge

df2 = DataFrame()
df2[:clo1] = Array([11.0,12.0,13.0])
df2[:clo2] = Array([14.0,15.0,16.0])
df2[:clo3] = Array([17.0,18.0,19.0])
df2[:ID] = Array(['a','b','c'])
show(df2)
>>3×4 DataFrame
│ Row │ clo1 │ clo2 │ clo3 │ ID │
│ │ Float64 │ Float64 │ Float64 │ Char │
├─────┼─────────┼─────────┼─────────┼─────┤
│ 1 │ 11.0 │ 14.0 │ 17.0 │'a' │
│ 2 │ 12.0 │ 15.0 │ 18.0 │'b' │
│ 3 │ 13.0 │ 16.0 │ 19.0 │'c' │

Combine df1 and df2

df3 = join(df1, df2, on=:ID)

Column rename

rename!(df1, :clo1, :cool1)

index

df = DataFrame(rand(5,6))
show(df)

5×6 DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼───────────┼──────────┼──────────┼───────── ─┼─────────┼───────────┤
│ 1 │ 0.678851 │ 0.514764 │ 0.829835 │ 0.996105 │ 0.427939 │ 0.163515 │
│ 2 │ 0.366487 │ 0.834486 │ 0.588859 │ 0.398066 │ 0.45738 │ 0.577453 │
│ 3 │ 0.0945619 │ 0.231903 │ 0.715664 │ 0.538696 │ 0.813534 │ 0.100683 │
│ 4 │ 0.102987 │ 0.269429 │ 0.292984 │ 0.719276 │ 0.756749 │ 0.705312 │
│ 5 │ 0.328692 │ 0.420872 │ 0.545552 │ 0.793303 │ 0.670403 │ 0.0619233 │

Use index to get the value, take 2~4 columns

df[2:4]
>> x2 x3 x4
    Float64 Float64 Float64
1 0.514764 0.829835 0.996105
2 0.834486 0.588859 0.398066
3 0.231903 0.715664 0.538696
4 0.269429 0.292984 0.719276
5 0.420872 0.545552 0.793303

Value by name

df.x1
>>5-element Array{Float64,1}:
 0.6788506862994854 
 0.366486647281848  
 0.09456191069734388
 0.10298681965872736
 0.32869200293154477

Take 2~4 rows, x1, x3 columns

df[2:4, [:x1, :x3]]
>> x1 x3
    Float64 Float64
1 0.366487 0.588859
2 0.0945619 0.715664
3 0.102987 0.292984

for loop traversal

for row in eachrow(df)
    println(row)
end

View the description information of df

describe(df)

Sort

sort!(df, cols = [:x1])

DataFrame csv file operation

using CSV
df = DataFrame(rand(5,6))
show(df)
>>5×6 DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼──────────┼──────────┼───────────┼───────── ─┼──────────┼───────────┤
│ 1 │ 0.173904 │ 0.981004 │ 0.997001 │ 0.742527 │ 0.816245 │ 0.950086 │
│ 2 │ 0.946748 │ 0.661713 │ 0.0734147 │ 0.181932 │ 0.175604 │ 0.0550761 │
│ 3 │ 0.638618 │ 0.30026 │ 0.926336 │ 0.427942 │ 0.803738 │ 0.481783 │
│ 4 │ 0.156693 │ 0.943436 │ 0.0614211 │ 0.35279 │ 0.0692527 │ 0.417888 │
│ 5 │ 0.351843 │ 0.64204 │ 0.934611 │ 0.910804 │ 0.715309 │ 0.3677 │

Write file

CSV.write("df123.csv",df)

Read file

data = CSV.read("df123.csv")

Macros in DataFrame

@where

using DataFramesMeta
@where(df, :x1 .> 0.2)

image

@with

@with(df, :x2 .+1)
>>5-element Array{Float64,1}:
 1.9810036040861598
 1.6617132284630372
 1.3002596213068984
 1.9434355174941438
 1.6420398463078156

@select

@select(df, :x3)
>> x3
    Float64
1 0.997001
2 0.0734147
3 0.926336
4 0.0614211
5 0.934611
@select(df, p = 2 * :x1, :x2)
>> p x2
    Float64 Float64
1 0.347808 0.981004
2 1.8935 0.661713
3 1.27724 0.30026
4 0.313385 0.943436
5 0.703686 0.64204

RDatasets

RDatasets is a data set in Julia, which contains a lot of data that can be learned and verified, including the iris data set.

Introduction to iris data set

In the field of machine learning, there are a large number of public data sets. Iris is a very important one.

Iris Data Set (Iris data set) is a long history of data set, it first appeared in the famous British statistician and biologist Ronald Fisher in the 1936 paper "The use of multiple measurements in taxonomic problems" , Is used to introduce linear discriminant analysis. In this data set, three different types of Iris plants are included: Iris Setosa, Iris Versicolour, and Iris Virginica. Each category collected 50 samples, so this data set contains a total of 150 samples.

The data set measures 4 features of all 150 samples, which are:

  • sepal length
  • sepal width
  • petal length
  • petal width

The units of the above four characteristics are all centimeters (cm).

Usually m is used to represent the size of the sample, and n is the number of features each sample has.

using RDatasets
iris = dataset("datasets", "iris")
features = convert(Array, iris[:, 1:4])
labels = convert(Array, iris[:, 5])

Gadfly drawing

p = plot(iris, x=:SepalLength, y=:SepalWidth, Geom.point)

Output picture to file

img = SVG("iris_plot.SVG", 10cm, 8cm)
draw(img, p)

Connect with wires

p = plot(iris, x=:SepalLength, y=:SepalWidth, Geom.point, Geom.line)

Use Species as color basis

p = plot(iris, x=:SepalLength, y=:SepalWidth, Geom.point, color=:Species)

Add label to the picture

p = plot(iris, x=:SepalLength, y=:SepalWidth, label=:Species, color=:Species, Geom.point, Geom.label)

Histogram

fig1a = plot(iris, x="SepalLength", y="SepalWidth", Geom.point)
fig1b = plot(iris, x="SepalWidth", Geom.bar)
fig1 = hstack(fig1a, fig1b)

Probability distributions

using Distributions

Generate a normal distribution model with a mean of 20 and a standard deviation of 2.0

n1 = Normal(20,2.0)

params(n1)
>>(20,2.0)

fieldnames(typeof(n1))
>>(:μ, :σ)

The type of Normal is UnionAll

typeof(Normal)
>>UnionAll

Generate random numbers that satisfy the n1 model

rand(n1,100)

Binomial distribution

b = Binomial(20, 0.8)
rand(b, 1000)

Probability Density

n = Normal()
pdf(n, 0.4)
>>0.36827014030332333

pdf(n, 0) #The probability density of the standard normal distribution at x=0 is 0.3989...
>>0.3989422804014327

Distribution function

cdf(n, 0.4)
>>0.6554217416103242

Quantile

quantile.(n, [0.25, 0.5, 0.95])
>>3-element Array{Float64,1}:
 -0.6744897501960818
  0.0               
  1.6448536269514717

For the random number rand(10), analyze its parameter values ​​according to the Normal distribution

fit(Normal, rand(10))
>>Normal{Float64}(μ=0.41375573240875224, σ=0.25565343114046984)

StatsBase library: It also contains commonly used functions for statistics

using StatsBase
a = collect(1:6)
b = collect(4:9)
countne(a,b) #Compare in order 1!=4 2!=5 ... 6!=9
>>6
a = [1,2,3,4,5]
b = [4,1,3,2,5]
counteq(a,b) # Compare the number of equal elements in two vectors in order
>>2
L1dist(a,b) # abs(a[1]-b[1]) + ... + abs(a[n]-b[n])
>>6.0

L2dist(a, b) # sqrt((a[1]-b[1])^2 + ... + (a[n]-b[n])^2)

meanad(a, b) # mean absolute deviation L1dist(a,b)/n
>>1.2
sample(a) # Sample data once from a
sample(a, 3) # Sample data 3 times from a and return a 1-dimensional Array
>>3-element Array{Int64,1}:
 3
 2
 3
a1 = [1, 10, 20, 30]
a2 = [100, 200, 300]
sample!(a1, a2) # From a1, take out length(a2) data according to the type of a2
>>3-element Array{Int64,1}:
 20
 10
 30
a1 = [1, 10, 20, 30]
a2 = [100.0, 200.0, 300.0]
sample!(a1, a2)
>>3-element Array{Float64,1}:
  1.0
 30.0
  1.0

Other commonly used functions

x = [1.0, 2.0, 3.0, 4.0]
autocov(x)
autocov(x, [2,3])
autocor(x)
autocor(x, [2,3])
y = [3., 4., 5., 6.]
crosscov(x, y)
crosscov(x, y, [2,3])

sequentially

using TimeSeries

New Date variable

dates = collect(Date(2018, 1,1) :Day(1): Date(2018, 12, 31))
dates[1:20]

New TimeArray

ta = TimeArray(dates, rand(length(dates),2))
>>365×2 TimeArray{Float64,2,Date,Array{Float64,2}} 2018-01-01 to 2018-12-31
│ │ A │ B │
├────────────┼────────┼────────┤
│ 2018-01-01 │ 0.7933 │ 0.4359 │
│ 2018-01-02 │ 0.6347 │ 0.7234 │
│ 2018-01-03 │ 0.2454 │ 0.6826 │
│ 2018-01-04 │ 0.767 │ 0.9007 │
│ 2018-01-05 │ 0.1884 │ 0.5429 │
│ 2018-01-06 │ 0.7809 │ 0.2292 │
│ 2018-01-07 │ 0.9329 │ 0.0098 │
│ 2018-01-08 │ 0.0032 │ 0.7566 │
│ 2018-01-09 │ 0.1355 │ 0.8986 │
│ 2018-01-10 │ 0.9274 │ 0.6022 │
│ 2018-01-11 │ 0.6483 │ 0.53 │
│ 2018-01-12 │ 0.8594 │ 0.0237 │
   ⋮
│ 2018-12-21 │ 0.7245 │ 0.5286 │
│ 2018-12-22 │ 0.9374 │ 0.8508 │
│ 2018-12-23 │ 0.0528 │ 0.1361 │
│ 2018-12-24 │ 0.1184 │ 0.6156 │
│ 2018-12-25 │ 0.3645 │ 0.7896 │
│ 2018-12-26 │ 0.023 │ 0.8259 │
│ 2018-12-27 │ 0.0942 │ 0.9326 │
│ 2018-12-28 │ 0.6124 │ 0.3102 │
│ 2018-12-29 │ 0.5602 │ 0.3305 │
│ 2018-12-30 │ 0.273 │ 0.7611 │
│ 2018-12-31 │ 0.7595 │ 0.1093 │

Take out the timestamp

timestamp(ta::TimeArray)

Value

values(ta::TimeArray)

Take column name

colnames(ta::TimeArray)

Machine learning

MLBase

using MLBase

The basic machine learning library does not contain any machine learning algorithms, but provides many necessary tools for machine learning, such as Cross validation, etc.

Let's first look at a few simple data processing functions in MLBase

repeach(1:3, 2)
>>6-element Array{Int64,1}:
 1
 1
 2
 2
 3
 3

repeach(["a", "b", "c"], 2)
>>6-element Array{String,1}:
 "a"
 "a"
 "b"
 "b"
 "c"
 "c"

repeach(["a", "b", "c"], [1, 2, 3])
>>6-element Array{String,1}:
 "a"
 "b"
 "b"
 "c"
 "c"
 "c"
A = reshape(collect(1:6), 2, 3)
repeachcol(A, 2)
>>2×6 Array{Int64,2}:
 1 1 3 3 5 5
 2 2 4 4 6 6
repeachrow(A, 2)
>>4×3 Array{Int64,2}:
 1 3 5
 1 3 5
 2 4 6
 2 4 6
labels = labelmap(['a','b','b','c','c','c'])
>>LabelMap (with 3 labels):
[1] a
[2] b
[3] c
labelencode(labels, ['b','c','c','a'])
>>4-element Array{Int64,1}:
 2
 3
 3
 1

Decision tree

Use iris dataset

using RDatasets
iris = dataset("datasets", "iris")
features = convert(Array, iris[:, 1:4])
labels = convert(Array, iris[:, 5])

Call decision tree

using DecisionTree

In this library, contains

  • Decision Tree Classifier
  • Random Forest Classifier
  • Adaptive-Boosted Decision Stumps Classifier
  • Regression Tree
  • Regression Random Forest

We only introduce the classification model here, that is, the first three algorithm models

model = build_tree(labels, features)
>>Decision Tree
Leaves: 9
Depth: 5
model = prune_tree(model, 0.9)
>>Decision Tree
Leaves: 8
Depth: 5
print_tree(model, 5)
>>Feature 3, Threshold 2.45
L-> setosa: 50/50
R-> Feature 4, Threshold 1.75
    L-> Feature 3, Threshold 4.95
        L-> versicolor: 47/48
        R-> Feature 4, Threshold 1.55
            L-> virginica: 3/3
            R-> Feature 3, Threshold 5.449999999999999
                L-> versicolor: 2/2
                R-> virginica: 1/1
    R-> Feature 3, Threshold 4.85
        L-> Feature 2, Threshold 3.1
            L-> virginica: 2/2
            R-> versicolor: 1/1
        R-> virginica: 43/43
print_tree(model, 3)
>>Feature 3, Threshold 2.45
L-> setosa: 50/50
R-> Feature 4, Threshold 1.75
    L-> Feature 3, Threshold 4.95
        L-> versicolor: 47/48
        R-> 
    R-> Feature 3, Threshold 4.85
        L-> 
        R-> virginica: 43/43

In the following way

Use the decision tree model to make judgments

apply_tree(model, [5.9, 3.0, 5.1, 1.9])
>>"virginica"
apply_tree(model, [1.9, 2.0, 2.1, 1.9])
>>"setosa"

nfold cross validation for pruned tree

n_folds=3
accuracy = nfoldCV_tree(labels, features, n_folds, 0.9)
>>3×3 Array{Int64,2}:
 24 0 0
  0 13 0
  0 0 13
3×3 Array{Int64,2}:
 12 0 0
  0 17 0
  0 0 21
3×3 Array{Int64,2}:
 14 0 0
  0 20 0
  0 1 15

Fold 1
Classes: ["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 1.0
Kappa: 1.0

Fold 2
Classes: ["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 1.0
Kappa: 1.0

Fold 3
Classes: ["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.98
Kappa: 0.9695863746958637

Mean Accuracy: 0.9933333333333333
3-element Array{Float64,1}:
 1.0 
 1.0 
 0.98

adaptive boosting

Construct adaboost_stumps model

model, coeffs = build_adaboost_stumps(labels, features, 10)
>>(Ensemble of Decision Trees
Trees: 10
Avg Leaves: 2.0
Avg Depth: 1.0, [0.346574, 0.255413, 0.264577, 0.335749, 0.288846, 0.258287, 0.356882, 0.298766, 0.242485, 0.349976])

Application model judgment

apply_adaboost_stumps(model, coeffs, [5.9, 3.0, 5.1, 1.9])
>>"virginica"

adaboost nfold cross-validation

# 3 folds, 8 iteration
accuracy = nfoldCV_stumps(labels, features, 3, 8)

Random forest

Construct a random forest model

# 2 random features, 10 trees, 0.5 portion of samples
model = build_forest(labels, features, 2, 10, 0.5)
>>Ensemble of Decision Trees
Trees: 10
Avg Leaves: 6.4
Avg Depth: 4.4

Application model judgment

apply_forest(model, [5.9, 3.0, 5.1, 1.9])
>>"virginica"

forest nfold cross-validation

accuracy = nfoldCV_forest(labels, features, 3, 2)
>>3×3 Array{Int64,2}:
 12 0 0
  0 15 0
  0 1 22
3×3 Array{Int64,2}:
 18 0 0
  0 13 2
  0 1 16
3×3 Array{Int64,2}:
 20 0 0
  0 20 0
  0 0 10

Fold 1
Classes: ["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.98
Kappa: 0.9689440993788819

Fold 2
Classes: ["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.94
Kappa: 0.9096385542168673

Fold 3
Classes: ["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 1.0
Kappa: 1.0

Mean Accuracy: 0.9733333333333333
3-element Array{Float64,1}:
 0.98
 0.94
 1.0 

MultivariateStats

PCA (Principal Component Analysis) is a commonly used data analysis method. PCA transforms the original data into a set of linearly independent representations of each dimension through linear transformation, which can be used to extract the main feature components of the data, and is often used for dimensionality reduction of high-dimensional data.

using MultivariateStats, RDatasets 

# load iris dataset
iris = dataset("datasets", "iris")

# split half to training set
Xtr = convert(Array,(iris[1:2:end,1:4]))'
Xtr_labels = convert(Array,(iris[1:2:end,5]))

# split other half to testing set
Xte = convert(Array,(iris[2:2:end,1:4]))'
Xte_labels = convert(Array,(iris[2:2:end,5]))

# suppose Xtr and Xte are training and testing data matrix,
# with each observation in a column

# train a PCA model, allowing up to 3 dimensions
M = fit(PCA, Xtr; maxoutdim=3)

# apply PCA model to testing set
Yte = transform(M, Xte)

# reconstruct testing observations (approximately)
Xr = reconstruct(M, Yte)

# group results by testing set labels for color coding
setosa = Yte[:,Xte_labels.=="setosa"]
versicolor = Yte[:,Xte_labels.=="versicolor"]
virginica = Yte[:,Xte_labels.=="virginica"]
size(Xte)
>>(4, 75)
size(Yte)
>>(3, 75)

Draw the data after dimensionality reduction

using Plots
p = scatter(setosa[1,:],setosa[2,:],setosa[3,:],marker=:circle,linewidth=0)
scatter!(versicolor[1,:],versicolor[2,:],versicolor[3,:],marker=:circle,linewidth=0)
scatter!(virginica[1,:],virginica[2,:],virginica[3,:],marker=:circle,linewidth=0)

SVM

using LIBSVM
using RDatasets
iris = dataset("datasets", "iris")
features = convert(Array, iris[:, 1:4])
labels = convert(Array, iris[:, 5])
features_train, features_test = features[1:2:end, :], features[2:2:end, :]
labels_train, lables_test = labels[1:2:end], labels[2:2:end]
model = svmtrain(features_train', labels_train)
(predicted_labels, decision_values) = svmpredict(model, features_test')

Check forecast results

using Statistics
mean(predicted_labels .== lables_test) * 1.0
predicted_labels .== lables_test
>>75-element BitArray{1}:
 true
 true
 true
 true
 true
 true
 true
 true
 true
 true
 true
 true
 true
    ⋮
 true
 true
 true
 true
 true
 true
 true
 true
 true
 true
 true
 true

Linear regression

using GLM, GLMNet, DataFrames

lm model

data = DataFrame(X=[1,2,3], Y=[2,4,7])
ols = lm(@formula(Y ~ X), data)
>>StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: Y ~ 1 + X

Coefficients:
              Estimate Std.Error t value Pr(>|t|)
(Intercept) -0.666667 0.62361 -1.06904 0.4788
X 2.5 0.288675 8.66025 0.0732

Linear regression model standard deviation

stderror(ols)
>>2-element Array{Float64,1}:
 0.6236095644623245
 0.2886751345948133

Application model

newX = DataFrame(X=[2,3,4]);
GLM.predict(ols, newX)
>>3-element Array{Union{Missing, Float64},1}:
 4.333333333333332
 6.833333333333335
 9.33333333333334 
data = DataFrame(X=[1,2,3], Y=[2,4,6])
ols = lm(@formula(Y ~ X), data)
newX = DataFrame(X=[2,3,4])
GLM.predict(ols, newX)
>>3-element Array{Union{Missing, Float64},1}:
 4.0
 6.0
 8.0

glm model

data = DataFrame(X=[0,1,2,3,4], Y=[0.1296,0.0864,0.0576,0.0384,0.0256])
probit = glm(@formula(Y ~ X), data, Binomial(), ProbitLink())
>>StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Binomial{Float64},ProbitLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}}},Array{Float64,2}}}} }}

Formula: Y ~ 1 + X

Coefficients:
              Estimate Std.Error z value Pr(>|z|)
(Intercept) -1.14184 1.32732 -0.860259 0.3896
X -0.208282 0.659347 -0.315891 0.7521
newX = DataFrame(X=[1,2,3,4])
GLM.predict(probit, newX)
>>4-element Array{Union{Missing, Float64},1}:
 0.08848879660655808 
 0.05956904945792997 
 0.03864063839150974 
 0.024136051585243873

Linear regression model processing iris data set

data = DataFrame();
data[:x] = iris[1:50, :SepalWidth]
data[:y] = iris[1:50, :SepalLength]
model = lm(@formula(y ~ x), data)
>>StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x

Coefficients:
             Estimate Std.Error t value Pr(>|t|)
(Intercept) 2.639 0.310014 8.51251 <1e-10
x 0.69049 0.0898989 7.68074 <1e-9

Compare the LM fitted curve with the original curve

using Plots
plot(GLM.predict(model))
plot!(data[:y])

Process iris dataset with GLM

model = glm(@formula(y~x), data, Normal(), IdentityLink())
>>StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}}},Array{Float64,2}}}} }}

Formula: y ~ 1 + x

Coefficients:
             Estimate Std.Error z value Pr(>|z|)
(Intercept) 2.639 0.310014 8.51251 <1e-16
x 0.69049 0.0898989 7.68074 <1e-13

Compare the curve fitted by GLM with the original curve

using Plots
plot(GLM.predict(model))
plot!(data[:y])
# Estimates for coefficents
coef(model)
# Standard errors of coefficents
stderror(model)
# Covariance of coefficents
vcov(model)
# RSS
deviance(model)

K-means

using Clustering
using RDatasets
iris = dataset("datasets", "iris")
features = convert(Array, iris[:, 1:4])
labels = convert(Array, iris[:, 5]);
# choose 3 random starting points
initseeds(:rand, convert(Matrix,features'), 3)
>>3-element Array{Int64,1}:
  48
  80
 114
result = kmeans(features, 3)
>>KmeansResult{Float64}([5.1 2.45 0.2; 4.9 2.2 0.2;…; 6.2 4.4 2.3; 5.9 4.05 1.8], [1, 2, 2, 3], [3.63798e-12, 166.128, 166.127, 0.0], [1, 2, 1], [1.0, 2.0, 1.0], 332.2550000000019, 2, true)
using Gadfly
plot(iris,x="PetalLength",y="PetalWidth",color=result.assignments, Geom.point)
Reference: https://cloud.tencent.com/developer/article/1652991 12 Julia Scientific Computing-Cloud + Community-Tencent Cloud