Box-cox transformations consist of raising data to a certain power, such as squaring it, cubing it, or squaring it (raised to the 1/2 power). Since something to the 0th power is always 1, the ‘0th power’ in box-cox transformations is thought of to be the log transformation.

The logarithm function can especially boost model performance because it puts exponential functions on a linear scale. This means that linear models like linear regression can perform better on data.

Squaring and cubing a function can also straighten out a data or put emphasis on certain parts of data that are important.

*

*

* Decision Trees often have high bias because the algorithm finds niche patterns in data and creates specific nodes just to address them. If unchecked, a decision tree will create so many nodes that it will perform perfectly on the training data but fail at the testing data. One method to fix overfitting in decision trees is called pruning.

* Pruning reduces the size of decision trees by removing sections of the tree that provide little power to classify. This helps generalize the decision tree and forces it to only create nodes that are imperative to the data structure and not simply noise.

* It refers to issues with high-dimensional data — that do not occur in 2D/3D space.

* As dimensionality increases,

* In very high-dimensional space,

A normal distribution (aka Bell Curve) is a distribution with most instances clustered at the center, and the number of instances decreasing as distance from the center increases. typically 66% of data within one stdev from the mean, 95% with two stdevss, 99% within 3 stdevs.

* Ensembles are groups of algorithms that vote on the final decision.

* Ensembles succeed because one model’s weaknesses can be overvoted by other model’s strengths, but this means that a successful model must be diverse. This means that each model’s weakness must be different. Studies have shown that properly created ensembles almost always perform better than single classifiers.

* Bagging prepares multiples datasets by randomly selecting data from the main dataset (there will be overlap within the subsets). Multiple models are trained on one of the subsets, and their final decisions are aggregated through some function.

* Boosting iteratively adjusts the weight of an observation on the last classification. If an observation was classified correctly, it tries to increase the weight of the observation, and vice versa. Boosting decreases the bias error and builds strong predictive models.

Hard voting is when each model’s final classification (for example, 0 or 1) is aggregated, perhaps through the mean or mode. Soft voting is when each model’s final probabilities (for example, 85% sure of classification 1) are aggregated, most likely through the mean. Soft voting may be advantageous in certain cases but could lead to overfitting and a lack of generalization.

* An ecommerce company decides to give a $1000 gift voucher to the customers they think will purchase at least $5000 worth of items.

* If the company’s model has a false negative, it will (mistakenly) not send the voucher because it correctly believes that customer will not spend $5000.

* Although this is not ideal, the company does not lose any money. If the company send vouchers to a false positive (someone they incorrectly predict will spend $5000), the company will lose money to someone who will not spend at least $5000.

Recall: ‘out of all the actually true samples, how many did the model classify as true?’.

Precision: ‘out of all the samples our model classified as true, how many were actually true?’

* Mean Squared Error ‘highlights’ larger errors. As the derivative of x² is 2x, the larger the x, the larger the difference between x and x-1 is.

* Mean Absolute Error may be favored because it is a more interpretable result.

* Let’s suppose you are being tested for a disease — if you have the illness the test will end up saying you have the illness. If you don’t have the illness, 5% of the time the test will end up saying you have the illness (a false positive); 95% of the time the test will say that you do not have the illness.

* Therefore there is a 5% error when you

* Out of 1000 people, 1 person with the disease will get a true positive result. Out of the remaining 999 people, 5% will get a (false) positive result.

* ~50 people will get a positive result. Out of 1000 people:

* 51 people will test positive even though only one person has the illness.

* There is (only) a 2% probability of you having the disease even if the test is positive.

* Complete case treatment = removing any row that has a NA value. This is feasible if there are not very many NA values spread across several rows and there is sufficient data; otherwise, complete case treatment can be damaging. In real-world data, removing any rows with NA values could eliminate certain observable patterns in the data.

* When complete case treatment is not possible, there are a variety of methods to fill in missing data, such as mode, median, or mean. Which one to use depends on the context.

* Another method is to use find the k-nearest neighbors to a missing data point and use the average, median, or mode of those neighbors. It provides more customizability and specification that cannot be achieved by using a statistical summary value.

If the method used to fill in data is messily done, it could result in selection bias — a model is only as good as the data, and if the data is skewed, the model will be skewed as well.

* Recommender systems are a subclass of information filtering systems that predict preferences or ratings a user would give to a product.

*

*

* Yearly seasonality (ex: Christmas) may overlap with monthly/weekly/daily seasonality. This makes the time series non-stationary because the average of the variables is different in different time periods.

*

* SVM and Random Forest are classification algorithms. SVM is a better choice when data is clean & outlier-free. If not, Random Forest may be able to adapt to it. * SVM (especially with extensive parameter searches) consumes much more computational power than Random Forests. Random Forest will be better if you have memory constraints.

* Random Forest is preferred in multiclass problems. SVM is preferred in high-dimensional problems, such as text classification.

* The data should have a normal residual distribution, statistical dependence of errors, and have linearity.

* Bayesian Estimate models have some knowledge about the data (prior). There may be several values of the parameters that explain the data, and hence, we can look for multiple parameters like 5 gammas and 5 lambdas that do this. As a result of Bayesian Estimate, there are multiple models for making multiple predictions (one for each pair of parameters but with the same prior). So, if a new example needed to be predicted, then computing the weighted sum of these predictions serves the purpose.

* Max Likelihood does not take prior into consideration. It is analogous to being a Bayesian model using some sort of flat prior.

* Epoch: Represents one run through the entire dataset (everything put into a training model).

* Batch: Because it is computationally expensive to pass the entire dataset into the neural network at once, the dataset is divided into several batches.

* Iteration: The number of times a batch is run through each epoch. If we have 50,000 data rows and a batch size of 1,000, then each epoch will run 50 iterations.

* Convolutional layer: A layer that performs a convolutional operation that creates several picture windows, generalizing the image.

* Activation layer (usually ReLU): Introduces non-linearity to the network and converts all negative pixels to zero. The output becomes a rectified feature map.

* Pooling Layer: A down-sampling operation that reduces the dimensionality of a feature map.

* A convo layer is usually built from several iterations of convolutional, activation and pooling layers. It can be followed with one or more dense or dropout layers for further generalization, and finished with a fully connected layer.

* A dropout layer reduces overfitting in a neural network. It acts as a mask, randomly preventing connections to certain nodes. This forces each node to carry more information. Dropouts are sometimes used after max-pooling layers.

* The most conventional way: randomly initialize them close to 0. A proper optimizer can take the weights in the right direction.

* If the error space is too steep, it may be difficult for an optimizer to escape a local minima. Consider initializing several neural networks, each in different locations of the error space.

* Traditional NLP models are trained to predict the next word in a sentence, for example: ‘dog’ in “It’s raining cats and”. Other models may additionally train their models to predict the previous word in a sentence, given the context after it.

* BERT randomly masks a word in the sentence and forces the model to predict that word with both the context before and after it, for example: ‘raining’ in “It’s _____ cats and dogs.”

* This means BERT can detect more complex aspects of language that cannot be predicted by previous context.

* NER (aka entity identification, entity chunking, or entity extraction) is a subtask of information extraction that locates named entities in unstructured text into categories such as names, organization, locations, monetary values, time, etc.

* NER attempts to separate words that are spelled the same but mean different things and to correctly identify entities that may have sub-entities in their name, like ‘America’ in ‘Bank of America’.

* Since tweets are full of hashtags that may be of valuable information, the first step would be to extract hashtags and perhaps create a one-hot encoded set of features.

* The same can be done with @ characters (whichever account the tweet is directed at may be of importance).

* Tweets are also a case of compressed (due to character limit) writing, so there will probably be lots of purposeful misspellings that need to be corrected. Perhaps the number of misspellings in a tweet would be helpful as well — maybe angry tweet have more misspelled words.

* Removing punctuation, albeit standard in NLP preprocessing, may be skipped in this case because the use of exclamation marks, question marks, periods, etc. may be valuable. There may be three or more columns where the value for each row is the number of exclamation marks, question marks, etc. However, when feeding the data into a model the punctuation should be removed.

* The data would then be lemmatized and tokenized, and there is not just the raw text to feed into the model but also knowledge about hashtags, @s, misspellings, and punctuation, all of which will probably assist accuracy.

* First: convert paragraphs into a numerical form, with a vectorizer such as bag of words or TD-IDF. In this case, bag of words may be better, since the corpus (collection of texts) is not very large.

* Second: use cosine similarity or Euclidean distance to compute the similarity between the two vectors.

* The formula for Term Frequency if K/T.

* The formula for IDF is the logarithm of the total documents over the number of documents containing the term, or log of 1 over 1/3, or log of 3.

* The TF-IDF value for ‘hello’ is therefore K * log(3)/T.

* There are generally accepted stop words stored in the NLTK library in Python, but in certain contexts they should be lengthened or shortened.

* Given a dataset of tweets, the stop words should be more lenient because each tweet does not have much content to begin with. Hence, more information will be packed into the brief amount of characters, meaning that it may be irresponsible to discard what we deem to be stop words.

* However, given 1K short stories, be stricter on stop words to conserve computing time - and to differentiate more easily between each of the stories, which will probably all use many stop words several times.

* P-value is used to determine the significance of results after a hypothesis test in statistics. P-values help the analyzer draw conclusions, and is always on a scale of 0 to 1.

* A P-value >0.05 denotes weak evidence against the null hypothesis --> the null hypothesis cannot be rejected.

* A P-value <0.05 denotes strong evidence against the null hypothesis --> the null hypothesis can be rejected.

* A P-value =0.05 is the marginal value, indicating it is possible to go either way.

* A ROC curve = the false positive rate of a model plotted against its true positive rate.

* A completely random prediction will be a straight diagonal. The optimal model will be as close to the axes as possible.

* AUC (Area Under Curve) = a measure how close the ROC curve is to the axes. Higher AUC indicates a higher accuracy.

* Principal Component Analysis, is a method of dimension reduction - finds n orthogonal vectors that represent the most variance in the data, where n is the dimensions the user wants the data reduced to.

* PCA can speed up jobs or can be used to visualize high-dimensional data.

* Bias is a model error due to an oversimplified ML algorithm -- which can lead to underfitting. * When you train your model at that time model makes simplified assumptions to make the target function easier to understand.

* Low-bias algos: decision trees, KNN, and SVM.

* High-bias algos: linear and logistic regression.

* Variance is a model due an overly complex ML algorithm -- the model learns noise from the training data set, hence performing badly on test data. It can lead to high sensitivity and overfitting.

* Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens until a particular point — as you continue to make your model more complex, you end up over-fitting your model.

* Because it accepts a vector of real numbers and returns a probability distribution. Each element is non-negative and the sum over all components is 1.

* Term frequency-inverse document frequency reflects how important a word is to a document in a corpus. It is used as a weighting factor in information retrieval and text mining.

* TF–IDF increases proportionally to the number of times a word appears in the document but decreases proportionally by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

* Sampling bias is a systematic error due to a non-random sampling of a population.

* This causes some members of the population to be less included than others, such as low-income families being excluded from an online poll.

* Time interval bias is when a trial may be terminated early at an extreme value (usually for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.

* Data bias is when specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to a previously stated or generally agreed on criteria. * Attrition bias is caused by loss of participants discounting trial subjects that did not run to completion.

Where T is True, F is False, P is Positive, and N is Negative, each denoting the number of items in a confusion matrix.

* Error Rate: (FP + FN) / (P + N)

* Accuracy: (TP + TN) / (P + N)

* Sensitivity/Recall: TP / P

* Specificity: TN / N

* Precision: TP / (TP + FP)

* F-Score: Harmonic mean of precision and recall.

* Correlation measures & estimates the relationship between two variables, and measures how strongly two variables are related.

* Covariance measures the extent to which two random variables change in tandem.

* A/B testing is hypothesis testing for a randomized experiment with two variables A and B.

* It is effective because it minimizes conscious bias — those in group A do not know that they are in group A, or that there even is a group B, and vice versa.

* However, A/B testing is difficult to perform on any context other than Internet businesses.

* One solution is to roll the die twice. This means there are 6 x 6 = 36 possible outcomes. By excluding one combination (say, 6 and 6), there are 35 possible outcomes.

* Therefore if we assign five combinations of rolls (order does matter!) to one number, we can generate a random number between 1 and 7.

* For instance, say we roll a (1, 2). Since we have (hypothetically) defined the roll combinations (1, 1), (1, 2), (1, 3), (1, 4), and (1, 5) to the number 1, the randomly generated number would be 1.

* Univariate analyses are performed on only one variable. Examples: pie charts, distribution plots, and boxplots.

* Bivariate analysis map relationships between two variables. Examples: scatterplots or contour plots, as well as time series forecasting.

* Multivariate analysis deals with more than two variables to understand the effect of those variable on a target variable. This can include training neural networks for predictions or SHAP values/permutation importance to find the most important feature. It could also include scatterplots with a third feature like color or size.

* Cross validation measure how well a model generalizes to an entire dataset. A traditional train-test-split method, in which part of the data is randomly selected to be training data and the other fraction test data, may mean that the model performs well on certain randomly selected fractions of test data and poorly on other randomly selected test data.

* In other words, the performance is not nearly indicative of the model’s performance as it is of the randomness of the test data.

* Cross validation splits the data into n segments. The model is trained on n-1 segments of the data and is tested on the remaining segment of data. Then, the model is refreshed and trained on a different set of n-1 segments of data. This repeats until the model has predicted values for the entire data (of which the results are averaged).

* Naive Bayes is based on Bayes’ Theorem, which describes the probability of an event, based on prior knowledge of conditions that might be related to the event. It is considered to be ‘naive’ because it makes assumptions that may or may not be correct. This is why it can be very powerful when used correctly — it can bypass knowledge other models must find because it assumes that it is true.

Linear Kernel

Polynomial Kernel

Radial Basis Kernel

Sigmoid Kernel

* Collaborative filtering solely relies on user ratings to determine what a new user might like next. All product attributes are either learned through user interactions or discarded. One example of collaborative filtering is matrix factorization.

* Content filtering relies only on intrinsic attributes of products and customers, such as product price, customer age, etc., to make recommendations. One way to achieve content filtering is to measure similarity between a profile vector and an item vector, such as cosine similarity.

* Hybrid filtering combines content and collaborative filtering recommendations. Which filter to use depends on the real-world context — hybrid filtering may not always be the definitive answer.

* SVM: a partial fit would work. The dataset could be split into several smaller-size datasets. Because SVM is a low-computational cost algorithm, it may be the best case in this scenario.

* If the data is not suitable for SVM, a Neural Network with a small batch size could be trained on a compressed NumPy array. NumPy has several tools for compressing large datasets, which are integrated into common neural network packages like Keras/TensorFlow and PyTorch.

If the learning rate it too low, the training of the model will progress very slowly, as the weights are making minimal updates. However, if the learning rate is set too high, this may cause the loss function to jump erratically due to drastic updates in weights. The model may also fail to converge to an error or may even diverge in the case that the data is too chaotic for the network to train.

* A test set is used to evaluate a model’s performance after training.

* A validation set is used during training for parameter selection and to prevent overfitting on the training set.