Machine Learning and Deep Learning Concepts Explained
XGBoost Algorithm
Which of the following is/are true about the XGBoost algorithm?
- It consists of decision trees as base learners.
- It is not the best choice for complex image recognition or NLP projects.
- Boosting is a technique that converts a collection of weak learners into a strong learner.
- If we use a CART tree as the base model, each leaf node is assigned a numerical score.
Which of the following is/are true about the XGBoost model?
- It is an additive model where the base functions are chosen from the set of possible CART trees.
- The base tree maps an example to a leaf node.
- Each leaf node of the tree is given a weight.
- The weight on each leaf represents an example’s score from that tree.
Which of the following is/are true about the XGBoost model?
- There is a regularized objective function that we seek to minimize.
- The regularized objective function consists of two parts – a loss function and a second term that penalizes model complexity.
- The model complexity of the tree depends on the number of leaf nodes in the tree.
- The model complexity also depends on the L2 norm of the weights of each tree.
Neural Networks
What does each neuron of a neural network compute?
- It computes a non-linear transformation of the input and then applies an activation function to the transformation.
- It computes a linear transformation of the input and then applies an activation to the transformation.
- It applies a derivative and then an integral.
- None of the above.
What is the equation of the logistic loss function? You can assume the symbols have their usual meanings?
- −(yi log(p(yi)) + (1 − yi)log(1 − p(yi)))
Convolutional Neural Networks (CNN)
Which of the following would be the advantages of using a CNN vs a fully connected neural network?
- CNN learns the filters automatically without any human effort (known as hand-engineered features).
- CNN captures the spatial features from an image. Spatial features refer to local patterns that can be detected on any part of the image.
- In CNN, the hidden layers perform convolution operation rather than being fully connected to previous layer neurons.
- Due to weight sharing and local connections, the number of parameters needed in CNN are much lower as compared to fully connected layers in deep neural nets.
You come up with a CNN classifier. For each layer, calculate the number of weights, and the size of the associated feature maps. The notation follows the convention: CONV-K-N denotes a convolutional layer with N filters, each them of size K×K, Padding and stride parameters are always 0 and 1 respectively. POOL-K indicates a K × K pooling layer with stride K and padding 0. FC-N stands for a fully-connected layer with N neurons.
Text Processing and Embeddings
Consider the following sentences that form a text corpus:
Sentence 1: UTD is a wonderful school in Dallas
Sentence 2: Dallas is a wonderful place
If you encode these sentences using one-hot encoding, what will be the length of each vector?
Distinct words = [UTD, is, a, wonderful, school, in, Dallas, place]
Number of distinct words = 8
Number of vectors = 2
Length of each vector = 8.
Which of the following is/are true about the embedding layer for text processing
- It converts sparse vectors into dense vectors.
- The number of vectors remains the same, but their dimensions (sizes) are reduced.
- Embedding layer in TensorFlow can be used to learn dense representation.
- Embedding layer’s output dimension must be the same as input dimension.
Consider the following code snippet in TensorFlow
tf.keras.layers.Embedding(1000, 64, input_length=10)
How many parameters would be needed for this layer?
- 64,000
Consider the following model created using TensorFlow
embedding_dim=16
vocab_size = 1000
model = Sequential([
Embedding(vocab_size, embedding_dim, name="embedding")
Dense(16, activation='relu'),
Dense(1)
])
How many parameters would be needed for this model?
- Embedding layer = 1000 x 16 = 16,000
- First Dense layer = 16 x 16 + 16 = 272
- Second Dense Layer = 16 x 1 + 1 = 17
- Total = 16,289
Which of the following is/are true about the pre-trained word embeddings?
- They are an example of transfer learning, wherein knowledge from one task is used in another task.
- They are trained in such a way that semantically similar words have higher geometric similarity.
- Google Word2Vec and Stanford’s GloVe are two examples of pre-trained word embeddings.
- The weights of the pre-trained embeddings must be re-calculated for every scenario.
Long Short-Term Memory (LSTM)
Which of the following is/are valid inputs to a single directional LSTM cell?
- Cell state that carries relevant information from all previous cells.
- A connection from LSTM cell to itself.
- External input.
- Hidden state from previous cell.
K-Means Clustering
We would like to perform K-means clustering on the four point shown below:
x1 = (1, 1)
x2 = (2, 2)
x3 = (6, 6)
x4 = (7, 7)
The initial centers are x1 and x2 for clusters C1 and C2 respectively.
- We will use Manhattan distance. For the first iteration, which points will be assigned to C1 and C2? x1 to C1, and x2, x3, x4 to C2 New centers after first iteration: C1 = (1, 1), C2 = (5, 5).
We would like to perform K-means clustering on the four point shown below:
x1 = (1, 1)
x2 = (2, 2)
x3 = (6, 6)
x4 = (7, 7)
The initial centers are x1 and x2 for clusters C1 and C2 respectively.
- We will use Manhattan distance. For the second iteration, which points will be assigned to C1 and C2? x1, x2 to C1, and x3, x4 to C2 New centers after second iteration: C1 = (1.5, 1.5), C2 = (6.5, 6.5).
We would like to perform K-means clustering on the four point shown below:
x1 = (1, 1)
x2 = (2, 2)
x3 = (6, 6)
x4 = (7, 7)
The initial centers are x1 and x2 for clusters C1 and C2 respectively.
- We will use Manhattan distance. For the third iteration, which points will be assigned to C1 and C2? x1, x2 to C1, and x3, x4 to C2 New centers after third iteration: C1 = (1.5, 1.5), C2 = (6.5, 6.5) No change, so can stop.
We would like to perform K-means clustering on the four point shown below:
x1 = (1, 1)
x2 = (2, 2)
x3 = (6, 6)
x4 = (7, 7)
The initial centers are x1 and x2 for clusters C1 and C2 respectively.
- We will use Manhattan distance. What will be the value of Within Cluster Sum of Squares (WCSS) after the end of clustering? Cluster 1: 12 + 12, Cluster 2: 12 + 12 Total = 4.0.
Recommender Systems
- Types of user-user similarity to be familiar with:
- Jaccard: number of common items divided by the number of items in the union of the two sets.
- Cosine: take dot product of the two ratings vector with missing values treated as 0.
- Pearson’s correlation coefficient: indicates how two vectors vary with each other.
In the supriselib library, if you want to use KNNBasic algorithm with item based cosine similarity, which code snippet would you use:
sim_options = { 'name': 'cosine', 'user_based': False}
algo = KNNBaseline(sim_options=sim_options)
algo.fit(trainset)
Which of the following can be used to evaluate prediction of recommender systems in the surpriselib library?
- RMSE
- MAE
- MSE
There are two friends Jay and Bill. Jay is very generous in his ratings and never gives a rating below 3 out of 5, while Bill is very stingy in his ratings and never gives a rating above 3. Jay goes to restaurant EatAllYouCan and gives a rating of 4.1, how can I fairly predict Bill’s rating for that restaurant.
- By taking user bias into account.
- Idea behind latent factor approach: Users and items are mapped to a joint latent factor space of lower dimensionality. User-item interactions are modeled as inner product (dot product) in the lower dimensional space.
Which of the following are the baseline algorithms available in the surpriselib library of Python? Hint: https://surprise.readthedocs.io/en/stable/prediction_algorithms.html
- SVD
- ALS