Errata List for "The Data Science Design Manual" by Steven Skiena
Last addition: May 6, 2018
Non-trivial errata are denotes with a (*).
=============================================================
Page 12, line 10: "FOA" should be "FOIA"
Page 15, line 11: "characters" may be better written as "options".
Page 16, line 10: (age and martial status) should be (age and marital status).
(*) Page 31, line -12: P(B) = 8/36, not 9/36. This error propagates through
the next two corrections.
(*) Page 31, line -11: Should be $P(A)=27/36$ and $P(B|A) = 8/27$.
(*) Page 31 ,line -1: Should be $P(B|A)= (1 \cdot 8/36)/(27/36) = 8/27$.
(*) Page 33, line -7: The equation for $P(k = X)$ should be:
$P(k=X) = C'(k) = C(X \leq k ) - C(X \leq k - \delta)$
Page 45, line 13: In the formula 1-r^2 = V(r)/V(y), the variable r is being used for two different things. The first one is r-square and the second one is residual error.
Page 65: "The Freedom of Information Act (FOI) enables" should be "The Freedom of Information Act (FOIA) enables"
Page 82, line 12: "then wrote then wrote" should be "then wrote"
(*) Page 82, line 14: Delete the sentence:
"The median score among my students was closer than any single guess".
First it was meant to read "any other guess". But actually 2350 is closer.
(*) Page 101, line -9: "92.5th percentile" should be "97.7th percentile"
Page 119, problem 4-1: "Suppose we observe X = 5.08." X should be lower case italic x
Page 121, line -2: Baysian -> Bayesian
(*) Page 128, line -2, $\lambda$ should be $\mu$.
Page 132, Section 5.2 Line 2: "it is pays" should be "it pays"
(*) Page 132, last line, The equation for $P(k = X)$ should be:
$P(k=X) = C'(k) = C(X \leq k ) - C(X \leq k - \delta)$
(*) Page 137, line 15: The order of effect sizes is reversed: it should
be "small effects 85\% overlap, medium effects 67\% overlap, and large effects 53\% overlap".
(*) Page 144, line 26: (1-.0672) should be (1-.672)
Page 152, line -11:: "person will weight over" should be "person will weigh over"
Page 200 problems 6-11 - "Describe good practices in data visualization?"
Should end with a period instead of question mark
Page 200, line -9: Problem 6-13: "How would you to determine" should be "How would you determine"
(*) Page 219, line 2: "false positive and false negative rates" should be "true positive and false positive rates"
(**) Page 219, line 3: The true positive rate is just recall, i.e., TP / (TP + FN). The false positive rate is FP / (FP + TN).
(**) Page 219, paragraphs 2 and 3: This is a mess, because I reversed left and left and right..
line 5: the sweep should go from right to left.
line 8: "very left" should be "very right"
line 10: "as far to the right" should be "as far as the left"
line 15: "sweep our threshhold to the right" should be "sweep our threshhold to the left"
Page 220, line -5: "Ours in a hard task" should be "Ours is a hard task"
(*) Page 221, line 3: "none are classified as 1800" should be "almost none (1\%) are classified as 1800"
Page 221, line 14: "Sparse" rows and columns in the confusion matrix refers to raw counts. Figure 7.7 has been normalized so the counts are indeterminable.
Page 225: "Observe the the evaluation" should be "Observe the evaluation"
Page 236, problem 7-16: "following types of betable events"
betable should be spelled bettable
Page 238, line 8: Add last plus to get
$y = c_0 + c_1 x_1 + c_2 x_2 + \cdots + c_{m-1} x_{m-1}$
(*) Page 239, line 6: The left variable of the equation should be $w$, not $c$.
(*) Page 239, line 7: "Ax = b" should be "Aw = b".
Page 243, Figure 8.4 caption: the transposition is in the center, not on the right as printed.
Page 243, line 1: An alternate definition of transpose might say "flips the diagonal from upper left to lower right".
(*) Page 244, line 12: Change 4 to 3 in lower-left entry of leftmost matrix:
$$ \Bigg(
\begin{bmatrix}
1 & 2 \\
3 & 4 \\
\end{bmatrix}
\begin{bmatrix}
1 & 0 \\
0 & 2 \\
\end{bmatrix}
\Bigg) $$
(*) Page 257, Section 8.5, line 3: $\lambda_i \geq \lambda_{i-1}$
(*) Page 259, Section 8.5.1, lines 15 and 16: In both the equation and the
line following, $B_k^T$ should be $B_k$.
Page 264 8-10 "Ax=b?" Should end with a period instead of question mark
(*) Page 271, line 20: In the right hand side of the equation for $w_1$,
the $(sigma_x/sigma_y)$ part should be flipped to $(sigma_y/sigma_x)$.
Page 275, line -4: between 0 and 12 + 4 + 5 = 19 years,
19 --> 21
(*) Page 283, lines 3,4: there are errors in the equation, which is twice missing partial derivative symbols.
2 --> \342\210\202 ($\partial$)
(*) Page 285, line 9: there are errors in the equation, which is twice missing partial derivative symbols.
2 --> ∂ ($\partial$)
All partial derivatives should not have a 2 in the numerator. The 1/2 factor cancels from the squared error term when differentiating.
(*) Page 293, line -3: punish it aggressively when $f(y_i) \rightarrow 0$,
$y_i$ --> $x_i$
(*) Page 294, line -1: there is an error in the summation symbol and square bracket.
[ \sum --> \sum [
(*) Page 308, line 25: 0.03 should be $0.03^2$
Page 319, line 6: LSH wants similar items to recieve the exact same hash code
recieve --> receive
Page 321, line 7: like eigenvalue or singular value decomposition (see Section 8.5) on the the
the the --> the (duplicate)
Page 323, line -4: people do not see that a distance or similarity matrix is really just a graph than can take advantage of other tools.
"than" -> "that"
Page 325, line 21: The Pagerank formula has a long dash in out-degree, should be \text{out-degree}
Page 328, line -5: records in a (say) a hundred distinct subsets each ordered by similarity
in a --> into
Page 328, line -2: more accurate on this restricted class of items then a general model trained over all items.
then --> than
Page 332, line -3: sometimes called the k-mediods algorithm
mediod --> medoid
Page 333, line 17: k-means can proceed by reading the distances off this matrix.
off --> off of
Page 336, Figure 10.17 -- this figure is much too low resolution and blurry. Sorry.
Page 340, line -11: point in each cluster (medioid) as a representative in the general case
medioid --> medoid
Page 340, line -4, "... with using the fastest algorithm ," extra space between 'algorithm' and ','
Page 341, line -12: where I asked asked how many clusters
asked asked --> asked (duplicate)
Page 346 Problem 10.11 (a): "Might is possible this classifier" should be "Might it be possible that this classifier"
Page 349 10-31: How can you ... of the data. Should end with a question mark instead of period
Page 354, first equation (line 13): $P(B)$ should be $p(B)$.
Page 360, line 4: $t \in (10.5, 11.5, 17)$ might be clearer as $t \in (10.5, 12.5, 17)$, since (11 + 14) / 2 = 12.5
(*) Page 369, line 6: repeated equations, one should be equal to -1 the other to 1
Page 371, line -6: "more that d steps" should be "more than d steps"
Page 373, paragraph 4, line 5: "... and F[i, j] reflects ... times work i ..." work -> word
Page 389 11-14 What are some ... from traditional machine learning
No punctuation at end, should have question mark
Page 419 line 7, Chapter notes: Rajarman -> Rajaraman