dataset-opencompass/data/mmlu/test/machine_learning_test.csv
2025-07-18 07:25:44 +00:00

31 KiB
Raw Blame History

1Statement 1| Linear regression estimator has the smallest variance among all unbiased estimators. Statement 2| The coefficients α assigned to the classifiers assembled by AdaBoost are always non-negative.True, TrueFalse, FalseTrue, FalseFalse, TrueD
2Statement 1| RoBERTa pretrains on a corpus that is approximate 10x larger than the corpus BERT pretrained on. Statement 2| ResNeXts in 2018 usually used tanh activation functions.True, TrueFalse, FalseTrue, FalseFalse, TrueC
3Statement 1| Support vector machines, like logistic regression models, give a probability distribution over the possible labels given an input example. Statement 2| We would expect the support vectors to remain the same in general as we move from a linear kernel to higher order polynomial kernels.True, TrueFalse, FalseTrue, FalseFalse, TrueB
4A machine learning problem involves four attributes plus a class. The attributes have 3, 2, 2, and 2 possible values each. The class has 3 possible values. How many maximum possible different examples are there?12244872D
5As of 2020, which architecture is best for classifying high-resolution images?convolutional networksgraph networksfully connected networksRBF networksA
6Statement 1| The log-likelihood of the data will always increase through successive iterations of the expectation maximation algorithm. Statement 2| One disadvantage of Q-learning is that it can only be used when the learner has prior knowledge of how its actions affect its environment.True, TrueFalse, FalseTrue, FalseFalse, TrueB
7Let us say that we have computed the gradient of our cost function and stored it in a vector g. What is the cost of one gradient descent update given the gradient?O(D)O(N)O(ND)O(ND^2)A
8Statement 1| For a continuous random variable x and its probability distribution function p(x), it holds that 0 ≤ p(x) ≤ 1 for all x. Statement 2| Decision tree is learned by minimizing information gain.True, TrueFalse, FalseTrue, FalseFalse, TrueB
9Consider the Bayesian network given below. How many independent parameters are needed for this Bayesian Network H -> U <- P <- W?24816C
10As the number of training examples goes to infinity, your model trained on that data will have:Lower varianceHigher varianceSame varianceNone of the aboveA
11Statement 1| The set of all rectangles in the 2D plane (which includes non axisaligned rectangles) can shatter a set of 5 points. Statement 2| The VC-dimension of k-Nearest Neighbour classifier when k = 1 is infinite.True, TrueFalse, FalseTrue, FalseFalse, TrueA
12_ refers to a model that can neither model the training data nor generalize to new data.good fittingoverfittingunderfittingall of the aboveC
13Statement 1| The F1 score can be especially useful for datasets with class high imbalance. Statement 2| The area under the ROC curve is one of the main metrics used to assess anomaly detectors.True, TrueFalse, FalseTrue, FalseFalse, TrueA
14Statement 1| The back-propagation algorithm learns a globally optimal neural network with hidden layers. Statement 2| The VC dimension of a line should be at most 2, since I can find at least one case of 3 points that cannot be shattered by any line.True, TrueFalse, FalseTrue, FalseFalse, TrueB
15High entropy means that the partitions in classification arepurenot pureusefuluselessB
16Statement 1| Layer Normalization is used in the original ResNet paper, not Batch Normalization. Statement 2| DCGANs use self-attention to stabilize training.True, TrueFalse, FalseTrue, FalseFalse, TrueB
17In building a linear regression model for a particular data set, you observe the coefficient of one of the features having a relatively high negative value. This suggests thatThis feature has a strong effect on the model (should be retained)This feature does not have a strong effect on the model (should be ignored)It is not possible to comment on the importance of this feature without additional informationNothing can be determined.C
18For a neural network, which one of these structural assumptions is the one that most affects the trade-off between underfitting (i.e. a high bias model) and overfitting (i.e. a high variance model):The number of hidden nodesThe learning rateThe initial choice of weightsThe use of a constant-term unit inputA
19For polynomial regression, which one of these structural assumptions is the one that most affects the trade-off between underfitting and overfitting:The polynomial degreeWhether we learn the weights by matrix inversion or gradient descentThe assumed variance of the Gaussian noiseThe use of a constant-term unit inputA
20Statement 1| As of 2020, some models attain greater than 98% accuracy on CIFAR-10. Statement 2| The original ResNets were not optimized with the Adam optimizer.True, TrueFalse, FalseTrue, FalseFalse, TrueA
21The K-means algorithm:Requires the dimension of the feature space to be no bigger than the number of samplesHas the smallest value of the objective function when K = 1Minimizes the within class variance for a given number of clustersConverges to the global optimum if and only if the initial means are chosen as some of the samples themselvesC
22Statement 1| VGGNets have convolutional kernels of smaller width and height than AlexNet's first-layer kernels. Statement 2| Data-dependent weight initialization procedures were introduced before Batch Normalization.True, TrueFalse, FalseTrue, FalseFalse, TrueA
23What is the rank of the following matrix? A = [[1, 1, 1], [1, 1, 1], [1, 1, 1]]0123B
24Statement 1| Density estimation (using say, the kernel density estimator) can be used to perform classification. Statement 2| The correspondence between logistic regression and Gaussian Naive Bayes (with identity class covariances) means that there is a one-to-one correspondence between the parameters of the two classifiers.True, TrueFalse, FalseTrue, FalseFalse, TrueC
25Suppose we would like to perform clustering on spatial data such as the geometrical locations of houses. We wish to produce clusters of many different sizes and shapes. Which of the following methods is the most appropriate?Decision TreesDensity-based clusteringModel-based clusteringK-means clusteringB
26Statement 1| In AdaBoost weights of the misclassified examples go up by the same multiplicative factor. Statement 2| In AdaBoost, weighted training error e_t of the tth weak classifier on training data with weights D_t tends to increase as a function of t.True, TrueFalse, FalseTrue, FalseFalse, TrueA
27MLE estimates are often undesirable becausethey are biasedthey have high variancethey are not consistent estimatorsNone of the aboveB
28Computational complexity of Gradient descent is,linear in Dlinear in Npolynomial in Ddependent on the number of iterationsC
29Averaging the output of multiple decision trees helps _.Increase biasDecrease biasIncrease varianceDecrease varianceD
30The model obtained by applying linear regression on the identified subset of features may differ from the model obtained at the end of the process of identifying the subset duringBest-subset selectionForward stepwise selectionForward stage wise selectionAll of the aboveC
31Neural networks:Optimize a convex objective functionCan only be trained with stochastic gradient descentCan use a mix of different activation functionsNone of the aboveC
32Say the incidence of a disease D is about 5 cases per 100 people (i.e., P(D) = 0.05). Let Boolean random variable D mean a patient “has disease D” and let Boolean random variable TP stand for "tests positive." Tests for disease D are known to be very accurate in the sense that the probability of testing positive when you have the disease is 0.99, and the probability of testing negative when you do not have the disease is 0.97. What is P(TP), the prior probability of testing positive.0.03680.4730.078None of the aboveC
33Statement 1| After mapped into feature space Q through a radial basis kernel function, 1-NN using unweighted Euclidean distance may be able to achieve better classification performance than in original space (though we cant guarantee this). Statement 2| The VC dimension of a Perceptron is smaller than the VC dimension of a simple linear SVM.True, TrueFalse, FalseTrue, FalseFalse, TrueB
34The disadvantage of Grid search isIt can not be applied to non-differentiable functions.It can not be applied to non-continuous functions.It is hard to implement.It runs reasonably slow for multiple linear regression.D
35Predicting the amount of rainfall in a region based on various cues is a ______ problem.Supervised learningUnsupervised learningClusteringNone of the aboveA
36Which of the following sentence is FALSE regarding regression?It relates inputs to outputs.It is used for prediction.It may be used for interpretation.It discovers causal relationshipsD
37Which one of the following is the main reason for pruning a Decision Tree?To save computing time during testingTo save space for storing the Decision TreeTo make the training set error smallerTo avoid overfitting the training setD
38Statement 1| The kernel density estimator is equivalent to performing kernel regression with the value Yi = 1/n at each point Xi in the original data set. Statement 2| The depth of a learned decision tree can be larger than the number of training examples used to create the tree.True, TrueFalse, FalseTrue, FalseFalse, TrueB
39Suppose your model is overfitting. Which of the following is NOT a valid way to try and reduce the overfitting?Increase the amount of training data.Improve the optimisation algorithm being used for error minimisation.Decrease the model complexity.Reduce the noise in the training data.B
40Statement 1| The softmax function is commonly used in mutliclass logistic regression. Statement 2| The temperature of a nonuniform softmax distribution affects its entropy.True, TrueFalse, FalseTrue, FalseFalse, TrueA
41Which of the following is/are true regarding an SVM?For two dimensional data points, the separating hyperplane learnt by a linear SVM will be a straight line.In theory, a Gaussian kernel SVM cannot model any complex separating hyperplane.For every kernel function used in a SVM, one can obtain an equivalent closed form basis expansion.Overfitting in an SVM is not a function of number of support vectors.A
42Which of the following is the joint probability of H, U, P, and W described by the given Bayesian Network H -> U <- P <- W? [note: as the product of the conditional probabilities]P(H, U, P, W) = P(H) * P(W) * P(P) * P(U)P(H, U, P, W) = P(H) * P(W) * P(P | W) * P(W | H, P)P(H, U, P, W) = P(H) * P(W) * P(P | W) * P(U | H, P)None of the aboveC
43Statement 1| Since the VC dimension for an SVM with a Radial Base Kernel is infinite, such an SVM must be worse than an SVM with polynomial kernel which has a finite VC dimension. Statement 2| A two layer neural network with linear activation functions is essentially a weighted combination of linear separators, trained on a given dataset; the boosting algorithm built on linear separators also finds a combination of linear separators, therefore these two algorithms will give the same result.True, TrueFalse, FalseTrue, FalseFalse, TrueB
44Statement 1| The ID3 algorithm is guaranteed to find the optimal decision tree. Statement 2| Consider a continuous probability distribution with density f() that is nonzero everywhere. The probability of a value x is equal to f(x).True, TrueFalse, FalseTrue, FalseFalse, TrueB
45Given a Neural Net with N input nodes, no hidden layers, one output node, with Entropy Loss and Sigmoid Activation Functions, which of the following algorithms (with the proper hyper-parameters and initialization) can be used to find the global optimum?Stochastic Gradient DescentMini-Batch Gradient DescentBatch Gradient DescentAll of the aboveD
46Adding more basis functions in a linear model, pick the most probably option:Decreases model biasDecreases estimation biasDecreases varianceDoesnt affect bias and varianceA
47Consider the Bayesian network given below. How many independent parameters would we need if we made no assumptions about independence or conditional independence H -> U <- P <- W?34715D
48Another term for out-of-distribution detection is?anomaly detectionone-class detectiontrain-test mismatch robustnessbackground detectionA
49Statement 1| We learn a classifier f by boosting weak learners h. The functional form of fs decision boundary is the same as hs, but with different parameters. (e.g., if h was a linear classifier, then f is also a linear classifier). Statement 2| Cross validation can be used to select the number of iterations in boosting; this procedure may help reduce overfitting.True, TrueFalse, FalseTrue, FalseFalse, TrueD
50Statement 1| Highway networks were introduced after ResNets and eschew max pooling in favor of convolutions. Statement 2| DenseNets usually cost more memory than ResNets.True, TrueFalse, FalseTrue, FalseFalse, TrueD
51If N is the number of instances in the training dataset, nearest neighbors has a classification run time ofO(1)O( N )O(log N )O( N^2 )B
52Statement 1| The original ResNets and Transformers are feedforward neural networks. Statement 2| The original Transformers use self-attention, but the original ResNet does not.True, TrueFalse, FalseTrue, FalseFalse, TrueA
53Statement 1| RELUs are not monotonic, but sigmoids are monotonic. Statement 2| Neural networks trained with gradient descent with high probability converge to the global optimum.True, TrueFalse, FalseTrue, FalseFalse, TrueD
54The numerical output of a sigmoid node in a neural network:Is unbounded, encompassing all real numbers.Is unbounded, encompassing all integers.Is bounded between 0 and 1.Is bounded between -1 and 1.C
55Which of the following can only be used when training data are linearly separable?Linear hard-margin SVM.Linear Logistic Regression.Linear Soft margin SVM.The centroid method.A
56Which of the following are the spatial clustering algorithms?Partitioning based clusteringK-means clusteringGrid based clusteringAll of the aboveD
57Statement 1| The maximum margin decision boundaries that support vector machines construct have the lowest generalization error among all linear classifiers. Statement 2| Any decision boundary that we get from a generative model with classconditional Gaussian distributions could in principle be reproduced with an SVM and a polynomial kernel of degree less than or equal to three.True, TrueFalse, FalseTrue, FalseFalse, TrueD
58Statement 1| L2 regularization of linear models tends to make models more sparse than L1 regularization. Statement 2| Residual connections can be found in ResNets and Transformers.True, TrueFalse, FalseTrue, FalseFalse, TrueD
59Suppose we like to calculate P(H|E, F) and we have no conditional independence information. Which of the following sets of numbers are sufficient for the calculation?P(E, F), P(H), P(E|H), P(F|H)P(E, F), P(H), P(E, F|H)P(H), P(E|H), P(F|H)P(E, F), P(E|H), P(F|H)B
60Which among the following prevents overfitting when we perform bagging?The use of sampling with replacement as the sampling techniqueThe use of weak classifiersThe use of classification algorithms which are not prone to overfittingThe practice of validation performed on every classifier trainedB
61Statement 1| PCA and Spectral Clustering (such as Andrew Ngs) perform eigendecomposition on two different matrices. However, the size of these two matrices are the same. Statement 2| Since classification is a special case of regression, logistic regression is a special case of linear regression.True, TrueFalse, FalseTrue, FalseFalse, TrueB
62Statement 1| The Stanford Sentiment Treebank contained movie reviews, not book reviews. Statement 2| The Penn Treebank has been used for language modeling.True, TrueFalse, FalseTrue, FalseFalse, TrueA
63What is the dimensionality of the null space of the following matrix? A = [[3, 2, 9], [6, 4, 18], [12, 8, 36]]0123C
64What are support vectors?The examples farthest from the decision boundary.The only examples necessary to compute f(x) in an SVM.The data centroid.All the examples that have a non-zero weight αk in a SVM.B
65Statement 1| Word2Vec parameters were not initialized using a Restricted Boltzman Machine. Statement 2| The tanh function is a nonlinear activation function.True, TrueFalse, FalseTrue, FalseFalse, TrueA
66If your training loss increases with number of epochs, which of the following could be a possible issue with the learning process?Regularization is too low and model is overfittingRegularization is too high and model is underfittingStep size is too largeStep size is too smallC
67Say the incidence of a disease D is about 5 cases per 100 people (i.e., P(D) = 0.05). Let Boolean random variable D mean a patient “has disease D” and let Boolean random variable TP stand for "tests positive." Tests for disease D are known to be very accurate in the sense that the probability of testing positive when you have the disease is 0.99, and the probability of testing negative when you do not have the disease is 0.97. What is P(D | TP), the posterior probability that you have disease D when the test is positive?0.04950.0780.6350.97C
68Statement 1| Traditional machine learning results assume that the train and test sets are independent and identically distributed. Statement 2| In 2017, COCO models were usually pretrained on ImageNet.True, TrueFalse, FalseTrue, FalseFalse, TrueA
69Statement 1| The values of the margins obtained by two different kernels K1(x, x0) and K2(x, x0) on the same training set do not tell us which classifier will perform better on the test set. Statement 2| The activation function of BERT is the GELU.True, TrueFalse, FalseTrue, FalseFalse, TrueA
70Which of the following is a clustering algorithm in machine learning?Expectation MaximizationCARTGaussian Naïve BayesAprioriA
71You've just finished training a decision tree for spam classification, and it is getting abnormally bad performance on both your training and test sets. You know that your implementation has no bugs, so what could be causing the problem?Your decision trees are too shallow.You need to increase the learning rate.You are overfitting.None of the above.A
72K-fold cross-validation islinear in Kquadratic in Kcubic in Kexponential in KA
73Statement 1| Industrial-scale neural networks are normally trained on CPUs, not GPUs. Statement 2| The ResNet-50 model has over 1 billion parameters.True, TrueFalse, FalseTrue, FalseFalse, TrueB
74Given two Boolean random variables, A and B, where P(A) = 1/2, P(B) = 1/3, and P(A | ¬B) = 1/4, what is P(A | B)?1/61/43/41D
75Existential risks posed by AI are most commonly associated with which of the following professors?Nando de FrietasYann LeCunStuart RussellJitendra MalikC
76Statement 1| Maximizing the likelihood of logistic regression model yields multiple local optimums. Statement 2| No classifier can do better than a naive Bayes classifier if the distribution of the data is known.True, TrueFalse, FalseTrue, FalseFalse, TrueB
77For Kernel Regression, which one of these structural assumptions is the one that most affects the trade-off between underfitting and overfitting:Whether kernel function is Gaussian versus triangular versus box-shapedWhether we use Euclidian versus L1 versus L∞ metricsThe kernel widthThe maximum height of the kernel functionC
78Statement 1| The SVM learning algorithm is guaranteed to find the globally optimal hypothesis with respect to its object function. Statement 2| After being mapped into feature space Q through a radial basis kernel function, a Perceptron may be able to achieve better classification performance than in its original space (though we cant guarantee this).True, TrueFalse, FalseTrue, FalseFalse, TrueA
79For a Gaussian Bayes classifier, which one of these structural assumptions is the one that most affects the trade-off between underfitting and overfitting:Whether we learn the class centers by Maximum Likelihood or Gradient DescentWhether we assume full class covariance matrices or diagonal class covariance matricesWhether we have equal class priors or priors estimated from the data.Whether we allow classes to have different mean vectors or we force them to share the same mean vectorB
80Statement 1| Overfitting is more likely when the set of training data is small. Statement 2| Overfitting is more likely when the hypothesis space is small.True, TrueFalse, FalseTrue, FalseFalse, TrueD
81Statement 1| Besides EM, gradient descent can be used to perform inference or learning on Gaussian mixture model. Statement 2 | Assuming a fixed number of attributes, a Gaussian-based Bayes optimal classifier can be learned in time linear in the number of records in the dataset.True, TrueFalse, FalseTrue, FalseFalse, TrueA
82Statement 1| In a Bayesian network, the inference results of the junction tree algorithm are the same as the inference results of variable elimination. Statement 2| If two random variable X and Y are conditionally independent given another random variable Z, then in the corresponding Bayesian network, the nodes for X and Y are d-separated given Z.True, TrueFalse, FalseTrue, FalseFalse, TrueC
83Given a large dataset of medical records from patients suffering from heart disease, try to learn whether there might be different clusters of such patients for which we might tailor separate treatments. What kind of learning problem is this?Supervised learningUnsupervised learningBoth (a) and (b)Neither (a) nor (b)B
84What would you do in PCA to get the same projection as SVD?Transform data to zero meanTransform data to zero medianNot possibleNone of theseA
85Statement 1| The training error of 1-nearest neighbor classifier is 0. Statement 2| As the number of data points grows to infinity, the MAP estimate approaches the MLE estimate for all possible priors. In other words, given enough data, the choice of prior is irrelevant.True, TrueFalse, FalseTrue, FalseFalse, TrueC
86When doing least-squares regression with regularisation (assuming that the optimisation can be done exactly), increasing the value of the regularisation parameter λ the testing error.will never decrease the training error.will never increase the training error.will never decrease the testing error.will never increaseA
87Which of the following best describes what discriminative approaches try to model? (w are the parameters in the model)p(y|x, w)p(y, x)p(w|x, w)None of the aboveA
88Statement 1| CIFAR-10 classification performance for convolution neural networks can exceed 95%. Statement 2| Ensembles of neural networks do not improve classification accuracy since the representations they learn are highly correlated.True, TrueFalse, FalseTrue, FalseFalse, TrueC
89Which of the following points would Bayesians and frequentists disagree on?The use of a non-Gaussian noise model in probabilistic regression.The use of probabilistic modelling for regression.The use of prior distributions on the parameters in a probabilistic model.The use of class priors in Gaussian Discriminant Analysis.C
90Statement 1| The BLEU metric uses precision, while the ROGUE metric uses recall. Statement 2| Hidden markov models were frequently used to model English sentences.True, TrueFalse, FalseTrue, FalseFalse, TrueA
91Statement 1| ImageNet has images of various resolutions. Statement 2| Caltech-101 has more images than ImageNet.True, TrueFalse, FalseTrue, FalseFalse, TrueC
92Which of the following is more appropriate to do feature selection?RidgeLassoboth (a) and (b)neither (a) nor (b)B
93Suppose you are given an EM algorithm that finds maximum likelihood estimates for a model with latent variables. You are asked to modify the algorithm so that it finds MAP estimates instead. Which step or steps do you need to modify?ExpectationMaximizationNo modification necessaryBothB
94For a Gaussian Bayes classifier, which one of these structural assumptions is the one that most affects the trade-off between underfitting and overfitting:Whether we learn the class centers by Maximum Likelihood or Gradient DescentWhether we assume full class covariance matrices or diagonal class covariance matricesWhether we have equal class priors or priors estimated from the dataWhether we allow classes to have different mean vectors or we force them to share the same mean vectorB
95Statement 1| For any two variables x and y having joint distribution p(x, y), we always have H[x, y] ≥ H[x] + H[y] where H is entropy function. Statement 2| For some directed graphs, moralization decreases the number of edges present in the graph.True, TrueFalse, FalseTrue, FalseFalse, TrueB
96Which of the following is NOT supervised learning?PCADecision TreeLinear RegressionNaive BayesianA
97Statement 1| A neural network's convergence depends on the learning rate. Statement 2| Dropout multiplies randomly chosen activation values by zero.True, TrueFalse, FalseTrue, FalseFalse, TrueA
98Which one of the following is equal to P(A, B, C) given Boolean random variables A, B and C, and no independence or conditional independence assumptions between any of them?P(A | B) * P(B | C) * P(C | A)P(C | A, B) * P(A) * P(B)P(A, B | C) * P(C)P(A | B, C) * P(B | A, C) * P(C | A, B)C
99Which of the following tasks can be best solved using Clustering.Predicting the amount of rainfall based on various cuesDetecting fraudulent credit card transactionsTraining a robot to solve a mazeAll of the aboveB
100After applying a regularization penalty in linear regression, you find that some of the coefficients of w are zeroed out. Which of the following penalties might have been used?L0 normL1 normL2 normeither (a) or (b)D
101A and B are two events. If P(A, B) decreases while P(A) increases, which of the following is true?P(A|B) decreasesP(B|A) decreasesP(B) decreasesAll of aboveB
102Statement 1| When learning an HMM for a fixed set of observations, assume we do not know the true number of hidden states (which is often the case), we can always increase the training data likelihood by permitting more hidden states. Statement 2| Collaborative filtering is often a useful model for modeling users' movie preference.True, TrueFalse, FalseTrue, FalseFalse, TrueA
103You are training a linear regression model for a simple estimation task, and notice that the model is overfitting to the data. You decide to add in $\ell_2$ regularization to penalize the weights. As you increase the $\ell_2$ regularization coefficient, what will happen to the bias and variance of the model?Bias increase ; Variance increaseBias increase ; Variance decreaseBias decrease ; Variance increaseBias decrease ; Variance decreaseB
104Which PyTorch 1.8 command(s) produce $10\times 5$ Gaussian matrix with each entry i.i.d. sampled from $\mathcal{N}(\mu=5,\sigma^2=16)$ and a $10\times 10$ uniform matrix with each entry i.i.d. sampled from $U[-1,1)$?\texttt{5 + torch.randn(10,5) * 16} ; \texttt{torch.rand(10,10,low=-1,high=1)}\texttt{5 + torch.randn(10,5) * 16} ; \texttt{(torch.rand(10,10) - 0.5) / 0.5}\texttt{5 + torch.randn(10,5) * 4} ; \texttt{2 * torch.rand(10,10) - 1}\texttt{torch.normal(torch.ones(10,5)*5,torch.ones(5,5)*16)} ; \texttt{2 * torch.rand(10,10) - 1}C
105Statement 1| The ReLU's gradient is zero for $x<0$, and the sigmoid gradient $\sigma(x)(1-\sigma(x))\le \frac{1}{4}$ for all $x$. Statement 2| The sigmoid has a continuous gradient and the ReLU has a discontinuous gradient.True, TrueFalse, FalseTrue, FalseFalse, TrueA
106Which is true about Batch Normalization?After applying batch normalization, the layers activations will follow a standard Gaussian distribution.The bias parameter of affine layers becomes redundant if a batch normalization layer follows immediately afterward.The standard weight initialization must be changed when using Batch Normalization.Batch Normalization is equivalent to Layer Normalization for convolutional neural networks.B
107Suppose we have the following objective function: $\argmin_{w} \frac{1}{2} \norm{Xw-y}^2_2 + \frac{1}{2}\gamma \norm{w}^2_2$ What is the gradient of $\frac{1}{2} \norm{Xw-y}^2_2 + \frac{1}{2}\lambda \norm{w}^2_2$ with respect to $w$?$\nabla_w f(w) = (X^\top X + \lambda I)w - X^\top y + \lambda w$$\nabla_w f(w) = X^\top X w - X^\top y + \lambda$$\nabla_w f(w) = X^\top X w - X^\top y + \lambda w$$\nabla_w f(w) = X^\top X w - X^\top y + (\lambda+1) w$C
108Which of the following is true of a convolution kernel?Convolving an image with $\begin{bmatrix}1 & 0 & 0\\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}$ would not change the imageConvolving an image with $\begin{bmatrix}0 & 0 & 0\\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}$ would not change the imageConvolving an image with $\begin{bmatrix}1 & 1 & 1\\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix}$ would not change the imageConvolving an image with $\begin{bmatrix}0 & 0 & 0\\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix}$ would not change the imageB
109Which of the following is false?Semantic segmentation models predict the class of each pixel, while multiclass image classifiers predict the class of entire image.A bounding box with an IoU (intersection over union) equal to $96\%$ would likely be considered at true positive.When a predicted bounding box does not correspond to any object in the scene, it is considered a false positive.A bounding box with an IoU (intersection over union) equal to $3\%$ would likely be considered at false negative.D
110Which of the following is false?The following fully connected network without activation functions is linear: $g_3(g_2(g_1(x)))$, where $g_i(x) = W_i x$ and $W_i$ are matrices.Leaky ReLU $\max\{0.01x,x\}$ is convex.A combination of ReLUs such as $ReLU(x) - ReLU(x-1)$ is convex.The loss $\log \sigma(x)= -\log(1+e^{-x})$ is concaveC
111We are training fully connected network with two hidden layers to predict housing prices. Inputs are $100$-dimensional, and have several features such as the number of square feet, the median family income, etc. The first hidden layer has $1000$ activations. The second hidden layer has $10$ activations. The output is a scalar representing the house price. Assuming a vanilla network with affine transformations and with no batch normalization and no learnable parameters in the activation function, how many parameters does this network have?111021110010111110110011A
112Statement 1| The derivative of the sigmoid $\sigma(x)=(1+e^{-x})^{-1}$ with respect to $x$ is equal to $\text{Var}(B)$ where $B\sim \text{Bern}(\sigma(x))$ is a Bernoulli random variable. Statement 2| Setting the bias parameters in each layer of neural network to 0 changes the bias-variance trade-off such that the model's variance increases and the model's bias decreasesTrue, TrueFalse, FalseTrue, FalseFalse, TrueC