Data Science Study Cards

Enhance Your Understanding with Data Science Programming Concept Cards for quick learning



Data Science

An interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.

Python

A popular programming language widely used in data science for its simplicity, readability, and extensive libraries such as NumPy, Pandas, and Matplotlib.

R

A programming language and software environment for statistical computing and graphics, commonly used in data analysis and visualization.

SQL

Structured Query Language, a programming language used for managing and manipulating relational databases, often used in data science for data extraction and transformation.

Data Cleaning

The process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in datasets to improve data quality and reliability.

Data Wrangling

The process of transforming and mapping raw data from various sources into a format suitable for analysis, often involving data cleaning, merging, and reshaping.

Data Visualization

The graphical representation of data to communicate information and insights effectively, using charts, graphs, and other visual elements.

Exploratory Data Analysis

The process of analyzing and summarizing data sets to gain insights, identify patterns, and formulate hypotheses, often using statistical graphics and data visualization techniques.

Machine Learning

A branch of artificial intelligence that focuses on the development of algorithms and models that enable computers to learn and make predictions or decisions without explicit programming.

Supervised Learning

A type of machine learning where the model is trained on labeled data, with input-output pairs, to make predictions or classifications on unseen data.

Unsupervised Learning

A type of machine learning where the model is trained on unlabeled data, without specific output labels, to discover patterns, relationships, or structures in the data.

Deep Learning

A subfield of machine learning that focuses on artificial neural networks with multiple layers, capable of learning hierarchical representations of data for complex tasks.

Neural Networks

A computational model inspired by the structure and function of the human brain, consisting of interconnected nodes (neurons) that process and transmit information.

Natural Language Processing

A field of study that combines linguistics, computer science, and artificial intelligence to enable computers to understand, interpret, and generate human language.

Big Data

Extremely large and complex data sets that cannot be easily managed, processed, or analyzed using traditional data processing techniques.

Hadoop

An open-source framework that allows distributed processing of large datasets across clusters of computers, providing scalability and fault tolerance for big data applications.

Spark

An open-source cluster computing system that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance, commonly used for big data processing.

Statistical Analysis

The collection, analysis, interpretation, presentation, and organization of data to uncover patterns, relationships, and trends, often using statistical models and techniques.

Regression Analysis

A statistical method for modeling the relationship between a dependent variable and one or more independent variables, used for prediction and inference.

Classification Algorithms

Machine learning algorithms that assign categorical labels or classes to input data based on patterns and relationships learned from labeled training data.

Clustering Algorithms

Machine learning algorithms that group similar data points together based on their characteristics or proximity, often used for exploratory data analysis and pattern recognition.

Ensemble Methods

Techniques that combine multiple machine learning models to improve prediction accuracy and reduce overfitting, such as bagging, boosting, and stacking.

Feature Engineering

The process of selecting, transforming, and creating new features from raw data to improve the performance and interpretability of machine learning models.

Dimensionality Reduction

The process of reducing the number of input variables or features in a dataset while preserving the important information, often used to overcome the curse of dimensionality.

Time Series Analysis

A statistical technique for analyzing and forecasting time-dependent data, such as stock prices, weather patterns, or sales data, to identify trends, patterns, and seasonality.

Cross-Validation

A technique for assessing the performance and generalization ability of machine learning models by partitioning the data into training and validation sets, iteratively.

Overfitting

A phenomenon in machine learning where a model performs well on the training data but fails to generalize to unseen data, often due to excessive complexity or lack of regularization.

Underfitting

A phenomenon in machine learning where a model is too simple or lacks the capacity to capture the underlying patterns in the data, resulting in poor performance on both training and test data.

Bias-Variance Tradeoff

A fundamental concept in machine learning that refers to the tradeoff between a model's ability to fit the training data (low bias) and its ability to generalize to unseen data (low variance).

Precision and Recall

Evaluation metrics commonly used in classification tasks to measure the model's ability to correctly identify positive instances (precision) and capture all positive instances (recall).

Confusion Matrix

A table that summarizes the performance of a classification model by showing the counts of true positive, true negative, false positive, and false negative predictions.

ROC Curve

Receiver Operating Characteristic curve, a graphical plot that illustrates the performance of a binary classification model at various classification thresholds, showing the tradeoff between true positive rate and false positive rate.

AUC-ROC

Area Under the ROC Curve, a metric that quantifies the overall performance of a binary classification model, representing the probability that a randomly chosen positive instance will be ranked higher than a randomly chosen negative instance.

Hyperparameter Tuning

The process of selecting the optimal values for the hyperparameters of a machine learning model, often using techniques like grid search, random search, or Bayesian optimization.

Bias

A systematic error or deviation from the true value in a statistical analysis, often caused by flawed assumptions, faulty data collection, or inappropriate modeling techniques.

Variance

The variability or spread of a model's predictions for different training sets, often caused by the model's sensitivity to small fluctuations in the training data.

Regularization

A technique used to prevent overfitting by adding a penalty term to the loss function, encouraging the model to favor simpler solutions and reduce the impact of noisy or irrelevant features.

Feature Importance

A measure of the contribution or importance of each feature in a machine learning model, often used to identify the most influential variables and understand their impact on the predictions.

Principal Component Analysis

A dimensionality reduction technique that transforms a dataset into a new set of orthogonal variables (principal components) that capture the maximum variance in the data.

Support Vector Machines

A supervised learning algorithm that separates data points into different classes by finding the optimal hyperplane that maximizes the margin between the classes.

Decision Trees

A supervised learning algorithm that builds a tree-like model of decisions and their possible consequences, using a hierarchical structure of nodes and branches.

Random Forests

An ensemble learning method that combines multiple decision trees to improve prediction accuracy and reduce overfitting, by averaging the predictions of individual trees.

Gradient Boosting

An ensemble learning method that combines multiple weak prediction models (typically decision trees) to create a strong predictive model, by iteratively correcting the mistakes of previous models.

Recurrent Neural Networks

A type of neural network designed to process sequential data, where the output of each step is fed back as input to the next step, allowing the network to retain information about previous steps.

Long Short-Term Memory

A type of recurrent neural network that addresses the vanishing gradient problem by introducing memory cells and gates to selectively remember or forget information over long sequences.