dataset-opencompass/machine_learning_test.csv at 447d314a29dc574711c89dc727e5254e979e1f37

4pdadmin 447d314a29 init

2025-07-18 07:25:44 +00:00

31 KiB

Raw Blame History

1	Statement 1\| Linear regression estimator has the smallest variance among all unbiased estimators. Statement 2\| The coefficients α assigned to the classifiers assembled by AdaBoost are always non-negative.	True, True	False, False	True, False	False, True	D
2	Statement 1\| RoBERTa pretrains on a corpus that is approximate 10x larger than the corpus BERT pretrained on. Statement 2\| ResNeXts in 2018 usually used tanh activation functions.	True, True	False, False	True, False	False, True	C
3	Statement 1\| Support vector machines, like logistic regression models, give a probability distribution over the possible labels given an input example. Statement 2\| We would expect the support vectors to remain the same in general as we move from a linear kernel to higher order polynomial kernels.	True, True	False, False	True, False	False, True	B
4	A machine learning problem involves four attributes plus a class. The attributes have 3, 2, 2, and 2 possible values each. The class has 3 possible values. How many maximum possible different examples are there?	12	24	48	72	D
5	As of 2020, which architecture is best for classifying high-resolution images?	convolutional networks	graph networks	fully connected networks	RBF networks	A
6	Statement 1\| The log-likelihood of the data will always increase through successive iterations of the expectation maximation algorithm. Statement 2\| One disadvantage of Q-learning is that it can only be used when the learner has prior knowledge of how its actions affect its environment.	True, True	False, False	True, False	False, True	B
7	Let us say that we have computed the gradient of our cost function and stored it in a vector g. What is the cost of one gradient descent update given the gradient?	O(D)	O(N)	O(ND)	O(ND^2)	A
8	Statement 1\| For a continuous random variable x and its probability distribution function p(x), it holds that 0 ≤ p(x) ≤ 1 for all x. Statement 2\| Decision tree is learned by minimizing information gain.	True, True	False, False	True, False	False, True	B
9	Consider the Bayesian network given below. How many independent parameters are needed for this Bayesian Network H -> U <- P <- W?	2	4	8	16	C
10	As the number of training examples goes to infinity, your model trained on that data will have:	Lower variance	Higher variance	Same variance	None of the above	A
11	Statement 1\| The set of all rectangles in the 2D plane (which includes non axisaligned rectangles) can shatter a set of 5 points. Statement 2\| The VC-dimension of k-Nearest Neighbour classifier when k = 1 is infinite.	True, True	False, False	True, False	False, True	A
12	_ refers to a model that can neither model the training data nor generalize to new data.	good fitting	overfitting	underfitting	all of the above	C
13	Statement 1\| The F1 score can be especially useful for datasets with class high imbalance. Statement 2\| The area under the ROC curve is one of the main metrics used to assess anomaly detectors.	True, True	False, False	True, False	False, True	A
14	Statement 1\| The back-propagation algorithm learns a globally optimal neural network with hidden layers. Statement 2\| The VC dimension of a line should be at most 2, since I can find at least one case of 3 points that cannot be shattered by any line.	True, True	False, False	True, False	False, True	B
15	High entropy means that the partitions in classification are	pure	not pure	useful	useless	B
16	Statement 1\| Layer Normalization is used in the original ResNet paper, not Batch Normalization. Statement 2\| DCGANs use self-attention to stabilize training.	True, True	False, False	True, False	False, True	B
17	In building a linear regression model for a particular data set, you observe the coefficient of one of the features having a relatively high negative value. This suggests that	This feature has a strong effect on the model (should be retained)	This feature does not have a strong effect on the model (should be ignored)	It is not possible to comment on the importance of this feature without additional information	Nothing can be determined.	C
18	For a neural network, which one of these structural assumptions is the one that most affects the trade-off between underfitting (i.e. a high bias model) and overfitting (i.e. a high variance model):	The number of hidden nodes	The learning rate	The initial choice of weights	The use of a constant-term unit input	A
19	For polynomial regression, which one of these structural assumptions is the one that most affects the trade-off between underfitting and overfitting:	The polynomial degree	Whether we learn the weights by matrix inversion or gradient descent	The assumed variance of the Gaussian noise	The use of a constant-term unit input	A
20	Statement 1\| As of 2020, some models attain greater than 98% accuracy on CIFAR-10. Statement 2\| The original ResNets were not optimized with the Adam optimizer.	True, True	False, False	True, False	False, True	A
21	The K-means algorithm:	Requires the dimension of the feature space to be no bigger than the number of samples	Has the smallest value of the objective function when K = 1	Minimizes the within class variance for a given number of clusters	Converges to the global optimum if and only if the initial means are chosen as some of the samples themselves	C
22	Statement 1\| VGGNets have convolutional kernels of smaller width and height than AlexNet's first-layer kernels. Statement 2\| Data-dependent weight initialization procedures were introduced before Batch Normalization.	True, True	False, False	True, False	False, True	A
23	What is the rank of the following matrix? A = [[1, 1, 1], [1, 1, 1], [1, 1, 1]]	0	1	2	3	B
24	Statement 1\| Density estimation (using say, the kernel density estimator) can be used to perform classification. Statement 2\| The correspondence between logistic regression and Gaussian Naive Bayes (with identity class covariances) means that there is a one-to-one correspondence between the parameters of the two classifiers.	True, True	False, False	True, False	False, True	C
25	Suppose we would like to perform clustering on spatial data such as the geometrical locations of houses. We wish to produce clusters of many different sizes and shapes. Which of the following methods is the most appropriate?	Decision Trees	Density-based clustering	Model-based clustering	K-means clustering	B
26	Statement 1\| In AdaBoost weights of the misclassified examples go up by the same multiplicative factor. Statement 2\| In AdaBoost, weighted training error e_t of the tth weak classifier on training data with weights D_t tends to increase as a function of t.	True, True	False, False	True, False	False, True	A
27	MLE estimates are often undesirable because	they are biased	they have high variance	they are not consistent estimators	None of the above	B
28	Computational complexity of Gradient descent is,	linear in D	linear in N	polynomial in D	dependent on the number of iterations	C
29	Averaging the output of multiple decision trees helps _.	Increase bias	Decrease bias	Increase variance	Decrease variance	D
30	The model obtained by applying linear regression on the identified subset of features may differ from the model obtained at the end of the process of identifying the subset during	Best-subset selection	Forward stepwise selection	Forward stage wise selection	All of the above	C
31	Neural networks:	Optimize a convex objective function	Can only be trained with stochastic gradient descent	Can use a mix of different activation functions	None of the above	C
32	Say the incidence of a disease D is about 5 cases per 100 people (i.e., P(D) = 0.05). Let Boolean random variable D mean a patient “has disease D” and let Boolean random variable TP stand for "tests positive." Tests for disease D are known to be very accurate in the sense that the probability of testing positive when you have the disease is 0.99, and the probability of testing negative when you do not have the disease is 0.97. What is P(TP), the prior probability of testing positive.	0.0368	0.473	0.078	None of the above	C
33	Statement 1\| After mapped into feature space Q through a radial basis kernel function, 1-NN using unweighted Euclidean distance may be able to achieve better classification performance than in original space (though we can’t guarantee this). Statement 2\| The VC dimension of a Perceptron is smaller than the VC dimension of a simple linear SVM.	True, True	False, False	True, False	False, True	B
34	The disadvantage of Grid search is	It can not be applied to non-differentiable functions.	It can not be applied to non-continuous functions.	It is hard to implement.	It runs reasonably slow for multiple linear regression.	D
35	Predicting the amount of rainfall in a region based on various cues is a ______ problem.	Supervised learning	Unsupervised learning	Clustering	None of the above	A
36	Which of the following sentence is FALSE regarding regression?	It relates inputs to outputs.	It is used for prediction.	It may be used for interpretation.	It discovers causal relationships	D
37	Which one of the following is the main reason for pruning a Decision Tree?	To save computing time during testing	To save space for storing the Decision Tree	To make the training set error smaller	To avoid overfitting the training set	D
38	Statement 1\| The kernel density estimator is equivalent to performing kernel regression with the value Yi = 1/n at each point Xi in the original data set. Statement 2\| The depth of a learned decision tree can be larger than the number of training examples used to create the tree.	True, True	False, False	True, False	False, True	B
39	Suppose your model is overfitting. Which of the following is NOT a valid way to try and reduce the overfitting?	Increase the amount of training data.	Improve the optimisation algorithm being used for error minimisation.	Decrease the model complexity.	Reduce the noise in the training data.	B
40	Statement 1\| The softmax function is commonly used in mutliclass logistic regression. Statement 2\| The temperature of a nonuniform softmax distribution affects its entropy.	True, True	False, False	True, False	False, True	A
41	Which of the following is/are true regarding an SVM?	For two dimensional data points, the separating hyperplane learnt by a linear SVM will be a straight line.	In theory, a Gaussian kernel SVM cannot model any complex separating hyperplane.	For every kernel function used in a SVM, one can obtain an equivalent closed form basis expansion.	Overfitting in an SVM is not a function of number of support vectors.	A
42	Which of the following is the joint probability of H, U, P, and W described by the given Bayesian Network H -> U <- P <- W? [note: as the product of the conditional probabilities]	P(H, U, P, W) = P(H) * P(W) * P(P) * P(U)	P(H, U, P, W) = P(H) * P(W) * P(P \| W) * P(W \| H, P)	P(H, U, P, W) = P(H) * P(W) * P(P \| W) * P(U \| H, P)	None of the above	C
43	Statement 1\| Since the VC dimension for an SVM with a Radial Base Kernel is infinite, such an SVM must be worse than an SVM with polynomial kernel which has a finite VC dimension. Statement 2\| A two layer neural network with linear activation functions is essentially a weighted combination of linear separators, trained on a given dataset; the boosting algorithm built on linear separators also finds a combination of linear separators, therefore these two algorithms will give the same result.	True, True	False, False	True, False	False, True	B
44	Statement 1\| The ID3 algorithm is guaranteed to find the optimal decision tree. Statement 2\| Consider a continuous probability distribution with density f() that is nonzero everywhere. The probability of a value x is equal to f(x).	True, True	False, False	True, False	False, True	B
45	Given a Neural Net with N input nodes, no hidden layers, one output node, with Entropy Loss and Sigmoid Activation Functions, which of the following algorithms (with the proper hyper-parameters and initialization) can be used to find the global optimum?	Stochastic Gradient Descent	Mini-Batch Gradient Descent	Batch Gradient Descent	All of the above	D
46	Adding more basis functions in a linear model, pick the most probably option:	Decreases model bias	Decreases estimation bias	Decreases variance	Doesn’t affect bias and variance	A
47	Consider the Bayesian network given below. How many independent parameters would we need if we made no assumptions about independence or conditional independence H -> U <- P <- W?	3	4	7	15	D
48	Another term for out-of-distribution detection is?	anomaly detection	one-class detection	train-test mismatch robustness	background detection	A
49	Statement 1\| We learn a classifier f by boosting weak learners h. The functional form of f’s decision boundary is the same as h’s, but with different parameters. (e.g., if h was a linear classifier, then f is also a linear classifier). Statement 2\| Cross validation can be used to select the number of iterations in boosting; this procedure may help reduce overfitting.	True, True	False, False	True, False	False, True	D
50	Statement 1\| Highway networks were introduced after ResNets and eschew max pooling in favor of convolutions. Statement 2\| DenseNets usually cost more memory than ResNets.	True, True	False, False	True, False	False, True	D
51	If N is the number of instances in the training dataset, nearest neighbors has a classification run time of	O(1)	O( N )	O(log N )	O( N^2 )	B
52	Statement 1\| The original ResNets and Transformers are feedforward neural networks. Statement 2\| The original Transformers use self-attention, but the original ResNet does not.	True, True	False, False	True, False	False, True	A
53	Statement 1\| RELUs are not monotonic, but sigmoids are monotonic. Statement 2\| Neural networks trained with gradient descent with high probability converge to the global optimum.	True, True	False, False	True, False	False, True	D
54	The numerical output of a sigmoid node in a neural network:	Is unbounded, encompassing all real numbers.	Is unbounded, encompassing all integers.	Is bounded between 0 and 1.	Is bounded between -1 and 1.	C
55	Which of the following can only be used when training data are linearly separable?	Linear hard-margin SVM.	Linear Logistic Regression.	Linear Soft margin SVM.	The centroid method.	A
56	Which of the following are the spatial clustering algorithms?	Partitioning based clustering	K-means clustering	Grid based clustering	All of the above	D
57	Statement 1\| The maximum margin decision boundaries that support vector machines construct have the lowest generalization error among all linear classifiers. Statement 2\| Any decision boundary that we get from a generative model with classconditional Gaussian distributions could in principle be reproduced with an SVM and a polynomial kernel of degree less than or equal to three.	True, True	False, False	True, False	False, True	D
58	Statement 1\| L2 regularization of linear models tends to make models more sparse than L1 regularization. Statement 2\| Residual connections can be found in ResNets and Transformers.	True, True	False, False	True, False	False, True	D
59	Suppose we like to calculate P(H\|E, F) and we have no conditional independence information. Which of the following sets of numbers are sufficient for the calculation?	P(E, F), P(H), P(E\|H), P(F\|H)	P(E, F), P(H), P(E, F\|H)	P(H), P(E\|H), P(F\|H)	P(E, F), P(E\|H), P(F\|H)	B
60	Which among the following prevents overfitting when we perform bagging?	The use of sampling with replacement as the sampling technique	The use of weak classifiers	The use of classification algorithms which are not prone to overfitting	The practice of validation performed on every classifier trained	B
61	Statement 1\| PCA and Spectral Clustering (such as Andrew Ng’s) perform eigendecomposition on two different matrices. However, the size of these two matrices are the same. Statement 2\| Since classification is a special case of regression, logistic regression is a special case of linear regression.	True, True	False, False	True, False	False, True	B
62	Statement 1\| The Stanford Sentiment Treebank contained movie reviews, not book reviews. Statement 2\| The Penn Treebank has been used for language modeling.	True, True	False, False	True, False	False, True	A
63	What is the dimensionality of the null space of the following matrix? A = [[3, 2, −9], [−6, −4, 18], [12, 8, −36]]	0	1	2	3	C
64	What are support vectors?	The examples farthest from the decision boundary.	The only examples necessary to compute f(x) in an SVM.	The data centroid.	All the examples that have a non-zero weight αk in a SVM.	B
65	Statement 1\| Word2Vec parameters were not initialized using a Restricted Boltzman Machine. Statement 2\| The tanh function is a nonlinear activation function.	True, True	False, False	True, False	False, True	A
66	If your training loss increases with number of epochs, which of the following could be a possible issue with the learning process?	Regularization is too low and model is overfitting	Regularization is too high and model is underfitting	Step size is too large	Step size is too small	C
67	Say the incidence of a disease D is about 5 cases per 100 people (i.e., P(D) = 0.05). Let Boolean random variable D mean a patient “has disease D” and let Boolean random variable TP stand for "tests positive." Tests for disease D are known to be very accurate in the sense that the probability of testing positive when you have the disease is 0.99, and the probability of testing negative when you do not have the disease is 0.97. What is P(D \| TP), the posterior probability that you have disease D when the test is positive?	0.0495	0.078	0.635	0.97	C
68	Statement 1\| Traditional machine learning results assume that the train and test sets are independent and identically distributed. Statement 2\| In 2017, COCO models were usually pretrained on ImageNet.	True, True	False, False	True, False	False, True	A
69	Statement 1\| The values of the margins obtained by two different kernels K1(x, x0) and K2(x, x0) on the same training set do not tell us which classifier will perform better on the test set. Statement 2\| The activation function of BERT is the GELU.	True, True	False, False	True, False	False, True	A
70	Which of the following is a clustering algorithm in machine learning?	Expectation Maximization	CART	Gaussian Naïve Bayes	Apriori	A
71	You've just finished training a decision tree for spam classification, and it is getting abnormally bad performance on both your training and test sets. You know that your implementation has no bugs, so what could be causing the problem?	Your decision trees are too shallow.	You need to increase the learning rate.	You are overfitting.	None of the above.	A
72	K-fold cross-validation is	linear in K	quadratic in K	cubic in K	exponential in K	A
73	Statement 1\| Industrial-scale neural networks are normally trained on CPUs, not GPUs. Statement 2\| The ResNet-50 model has over 1 billion parameters.	True, True	False, False	True, False	False, True	B
74	Given two Boolean random variables, A and B, where P(A) = 1/2, P(B) = 1/3, and P(A \| ¬B) = 1/4, what is P(A \| B)?	1/6	1/4	3/4	1	D
75	Existential risks posed by AI are most commonly associated with which of the following professors?	Nando de Frietas	Yann LeCun	Stuart Russell	Jitendra Malik	C
76	Statement 1\| Maximizing the likelihood of logistic regression model yields multiple local optimums. Statement 2\| No classifier can do better than a naive Bayes classifier if the distribution of the data is known.	True, True	False, False	True, False	False, True	B
77	For Kernel Regression, which one of these structural assumptions is the one that most affects the trade-off between underfitting and overfitting:	Whether kernel function is Gaussian versus triangular versus box-shaped	Whether we use Euclidian versus L1 versus L∞ metrics	The kernel width	The maximum height of the kernel function	C
78	Statement 1\| The SVM learning algorithm is guaranteed to find the globally optimal hypothesis with respect to its object function. Statement 2\| After being mapped into feature space Q through a radial basis kernel function, a Perceptron may be able to achieve better classification performance than in its original space (though we can’t guarantee this).	True, True	False, False	True, False	False, True	A
79	For a Gaussian Bayes classifier, which one of these structural assumptions is the one that most affects the trade-off between underfitting and overfitting:	Whether we learn the class centers by Maximum Likelihood or Gradient Descent	Whether we assume full class covariance matrices or diagonal class covariance matrices	Whether we have equal class priors or priors estimated from the data.	Whether we allow classes to have different mean vectors or we force them to share the same mean vector	B
80	Statement 1\| Overfitting is more likely when the set of training data is small. Statement 2\| Overfitting is more likely when the hypothesis space is small.	True, True	False, False	True, False	False, True	D
81	Statement 1\| Besides EM, gradient descent can be used to perform inference or learning on Gaussian mixture model. Statement 2 \| Assuming a fixed number of attributes, a Gaussian-based Bayes optimal classifier can be learned in time linear in the number of records in the dataset.	True, True	False, False	True, False	False, True	A
82	Statement 1\| In a Bayesian network, the inference results of the junction tree algorithm are the same as the inference results of variable elimination. Statement 2\| If two random variable X and Y are conditionally independent given another random variable Z, then in the corresponding Bayesian network, the nodes for X and Y are d-separated given Z.	True, True	False, False	True, False	False, True	C
83	Given a large dataset of medical records from patients suffering from heart disease, try to learn whether there might be different clusters of such patients for which we might tailor separate treatments. What kind of learning problem is this?	Supervised learning	Unsupervised learning	Both (a) and (b)	Neither (a) nor (b)	B
84	What would you do in PCA to get the same projection as SVD?	Transform data to zero mean	Transform data to zero median	Not possible	None of these	A
85	Statement 1\| The training error of 1-nearest neighbor classifier is 0. Statement 2\| As the number of data points grows to infinity, the MAP estimate approaches the MLE estimate for all possible priors. In other words, given enough data, the choice of prior is irrelevant.	True, True	False, False	True, False	False, True	C
86	When doing least-squares regression with regularisation (assuming that the optimisation can be done exactly), increasing the value of the regularisation parameter λ the testing error.	will never decrease the training error.	will never increase the training error.	will never decrease the testing error.	will never increase	A
87	Which of the following best describes what discriminative approaches try to model? (w are the parameters in the model)	p(y\|x, w)	p(y, x)	p(w\|x, w)	None of the above	A
88	Statement 1\| CIFAR-10 classification performance for convolution neural networks can exceed 95%. Statement 2\| Ensembles of neural networks do not improve classification accuracy since the representations they learn are highly correlated.	True, True	False, False	True, False	False, True	C
89	Which of the following points would Bayesians and frequentists disagree on?	The use of a non-Gaussian noise model in probabilistic regression.	The use of probabilistic modelling for regression.	The use of prior distributions on the parameters in a probabilistic model.	The use of class priors in Gaussian Discriminant Analysis.	C
90	Statement 1\| The BLEU metric uses precision, while the ROGUE metric uses recall. Statement 2\| Hidden markov models were frequently used to model English sentences.	True, True	False, False	True, False	False, True	A
91	Statement 1\| ImageNet has images of various resolutions. Statement 2\| Caltech-101 has more images than ImageNet.	True, True	False, False	True, False	False, True	C
92	Which of the following is more appropriate to do feature selection?	Ridge	Lasso	both (a) and (b)	neither (a) nor (b)	B
93	Suppose you are given an EM algorithm that finds maximum likelihood estimates for a model with latent variables. You are asked to modify the algorithm so that it finds MAP estimates instead. Which step or steps do you need to modify?	Expectation	Maximization	No modification necessary	Both	B
94	For a Gaussian Bayes classifier, which one of these structural assumptions is the one that most affects the trade-off between underfitting and overfitting:	Whether we learn the class centers by Maximum Likelihood or Gradient Descent	Whether we assume full class covariance matrices or diagonal class covariance matrices	Whether we have equal class priors or priors estimated from the data	Whether we allow classes to have different mean vectors or we force them to share the same mean vector	B
95	Statement 1\| For any two variables x and y having joint distribution p(x, y), we always have H[x, y] ≥ H[x] + H[y] where H is entropy function. Statement 2\| For some directed graphs, moralization decreases the number of edges present in the graph.	True, True	False, False	True, False	False, True	B
96	Which of the following is NOT supervised learning?	PCA	Decision Tree	Linear Regression	Naive Bayesian	A
97	Statement 1\| A neural network's convergence depends on the learning rate. Statement 2\| Dropout multiplies randomly chosen activation values by zero.	True, True	False, False	True, False	False, True	A
98	Which one of the following is equal to P(A, B, C) given Boolean random variables A, B and C, and no independence or conditional independence assumptions between any of them?	P(A \| B) * P(B \| C) * P(C \| A)	P(C \| A, B) * P(A) * P(B)	P(A, B \| C) * P(C)	P(A \| B, C) * P(B \| A, C) * P(C \| A, B)	C
99	Which of the following tasks can be best solved using Clustering.	Predicting the amount of rainfall based on various cues	Detecting fraudulent credit card transactions	Training a robot to solve a maze	All of the above	B
100	After applying a regularization penalty in linear regression, you find that some of the coefficients of w are zeroed out. Which of the following penalties might have been used?	L0 norm	L1 norm	L2 norm	either (a) or (b)	D
101	A and B are two events. If P(A, B) decreases while P(A) increases, which of the following is true?	P(A\|B) decreases	P(B\|A) decreases	P(B) decreases	All of above	B
102	Statement 1\| When learning an HMM for a fixed set of observations, assume we do not know the true number of hidden states (which is often the case), we can always increase the training data likelihood by permitting more hidden states. Statement 2\| Collaborative filtering is often a useful model for modeling users' movie preference.	True, True	False, False	True, False	False, True	A
103	You are training a linear regression model for a simple estimation task, and notice that the model is overfitting to the data. You decide to add in $\ell_2$ regularization to penalize the weights. As you increase the $\ell_2$ regularization coefficient, what will happen to the bias and variance of the model?	Bias increase ; Variance increase	Bias increase ; Variance decrease	Bias decrease ; Variance increase	Bias decrease ; Variance decrease	B
104	Which PyTorch 1.8 command(s) produce $10\times 5$ Gaussian matrix with each entry i.i.d. sampled from $\mathcal{N}(\mu=5,\sigma^2=16)$ and a $10\times 10$ uniform matrix with each entry i.i.d. sampled from $U[-1,1)$?	\texttt{5 + torch.randn(10,5) * 16} ; \texttt{torch.rand(10,10,low=-1,high=1)}	\texttt{5 + torch.randn(10,5) * 16} ; \texttt{(torch.rand(10,10) - 0.5) / 0.5}	\texttt{5 + torch.randn(10,5) * 4} ; \texttt{2 * torch.rand(10,10) - 1}	\texttt{torch.normal(torch.ones(10,5)5,torch.ones(5,5)16)} ; \texttt{2 * torch.rand(10,10) - 1}	C
105	Statement 1\| The ReLU's gradient is zero for $x<0$, and the sigmoid gradient $\sigma(x)(1-\sigma(x))\le \frac{1}{4}$ for all $x$. Statement 2\| The sigmoid has a continuous gradient and the ReLU has a discontinuous gradient.	True, True	False, False	True, False	False, True	A
106	Which is true about Batch Normalization?	After applying batch normalization, the layer’s activations will follow a standard Gaussian distribution.	The bias parameter of affine layers becomes redundant if a batch normalization layer follows immediately afterward.	The standard weight initialization must be changed when using Batch Normalization.	Batch Normalization is equivalent to Layer Normalization for convolutional neural networks.	B
107	Suppose we have the following objective function: $\argmin_{w} \frac{1}{2} \norm{Xw-y}^2_2 + \frac{1}{2}\gamma \norm{w}^2_2$ What is the gradient of $\frac{1}{2} \norm{Xw-y}^2_2 + \frac{1}{2}\lambda \norm{w}^2_2$ with respect to $w$?	$\nabla_w f(w) = (X^\top X + \lambda I)w - X^\top y + \lambda w$	$\nabla_w f(w) = X^\top X w - X^\top y + \lambda$	$\nabla_w f(w) = X^\top X w - X^\top y + \lambda w$	$\nabla_w f(w) = X^\top X w - X^\top y + (\lambda+1) w$	C
108	Which of the following is true of a convolution kernel?	Convolving an image with $\begin{bmatrix}1 & 0 & 0\\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}$ would not change the image	Convolving an image with $\begin{bmatrix}0 & 0 & 0\\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}$ would not change the image	Convolving an image with $\begin{bmatrix}1 & 1 & 1\\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix}$ would not change the image	Convolving an image with $\begin{bmatrix}0 & 0 & 0\\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix}$ would not change the image	B
109	Which of the following is false?	Semantic segmentation models predict the class of each pixel, while multiclass image classifiers predict the class of entire image.	A bounding box with an IoU (intersection over union) equal to $96\%$ would likely be considered at true positive.	When a predicted bounding box does not correspond to any object in the scene, it is considered a false positive.	A bounding box with an IoU (intersection over union) equal to $3\%$ would likely be considered at false negative.	D
110	Which of the following is false?	The following fully connected network without activation functions is linear: $g_3(g_2(g_1(x)))$, where $g_i(x) = W_i x$ and $W_i$ are matrices.	Leaky ReLU $\max\{0.01x,x\}$ is convex.	A combination of ReLUs such as $ReLU(x) - ReLU(x-1)$ is convex.	The loss $\log \sigma(x)= -\log(1+e^{-x})$ is concave	C
111	We are training fully connected network with two hidden layers to predict housing prices. Inputs are $100$-dimensional, and have several features such as the number of square feet, the median family income, etc. The first hidden layer has $1000$ activations. The second hidden layer has $10$ activations. The output is a scalar representing the house price. Assuming a vanilla network with affine transformations and with no batch normalization and no learnable parameters in the activation function, how many parameters does this network have?	111021	110010	111110	110011	A
112	Statement 1\| The derivative of the sigmoid $\sigma(x)=(1+e^{-x})^{-1}$ with respect to $x$ is equal to $\text{Var}(B)$ where $B\sim \text{Bern}(\sigma(x))$ is a Bernoulli random variable. Statement 2\| Setting the bias parameters in each layer of neural network to 0 changes the bias-variance trade-off such that the model's variance increases and the model's bias decreases	True, True	False, False	True, False	False, True	C

31 KiB Raw Blame History Unescape Escape

31 KiB

Raw Blame History