Data Analysis Process in Analytics / Data Science

February 19, 2020 Neil Harwani Leave a comment

This article is based on understanding from Wikipedia article on Data Analysis & my experiences in Data Science / Analytics / AI / ML – https://en.wikipedia.org/wiki/Data_analysis

Various areas like Data Mining, Predictive Analysis, Exploratory Data Analysis, Text Analytics, Business Intelligence, Confirmatory Data Analysis and Data Visualization overlap with this area

Before starting your journey on solving an industry or academic or research problem in Data Science / Analytics / AI / ML / Decision Science, a fundamental step where many students & professionals struggle is Data Analysis. In this article, I am providing a step by step approach on analyzing your data. Directly starting with programming of various algorithms or neural network on your data could at times be counterproductive and should be avoided. Initial stage should involve robust data analysis via steps given below followed by model building which can include custom or already proven algorithms or a derivative of some popular models. For each of the points discussed below, I have added additional information on top of interpretation of Wikipedia information based on my experience in industry towards the end of each of the points or I have added new points post the interpretations.

Your steps for data analysis should generally be:

Setup your data analysis process at a high level with your objectives – inspecting data, cleaning it, processing (could include dimensionality reduction / feature engineering), transformation, modelling and communicating it. Many forget the functional and feedback loop in this process setup to improve data quality – that must be included too.
Next step is in understanding the data in terms of what is it telling us. Data could be quantitative style numbers or textual or a mix of it. Treatment for all three is different. For quantitative / numerical data, we try to understand whether it is time-series, ranking, part to whole, deviation, frequency distribution, correlation, nominal or geographical or geospatial data. For textual or mixed type of data we need to use approaches of text mining, sentiment analysis, natural language processing to get insights around frequency of words, influential words & sentences by weight, trends, categories, clusters and more. Most of this article revolves around quantitative or numerical data perse and not textual data. I have provided a very brief idea on textual data analysis here in this point.
Next step would be to have the quantitative techniques being applied on the data in terms of sanity, audit / reconciliation of totals via formulas, relationships between data, checking things like whether variables in data are related in terms of correlation / sufficiency / necessity / etc. I would suggest using R Studio or similar tool for this step.
Post this we want to actually perform actions like filtering, sorting, checking range and classes, summary, clusters, relationships, context, extremes, etc. At this stage, exploratory data analysis techniques come in very handy where we use various libraries which provide graphical representation. Excel & Tableau come in handy here.
Our next step will be to check for biases, deciphering facts & opinions, deciphering any numerical incorrect / irrelevant inferences which are being projected and need correction / improvement. This needs detailed study of data from domain / functional perspective and applying statistical analysis on it. Working with a business / functional consultant in this phase is especially useful.
Some areas which we need to take care of include quality of data, quality of measurements, transformations various variables / observations into log scale or others like what we have on richter scale for earthquakes, mapping to objectives and characteristics. This is an intuitive step where visualizing data through various transformations in R / Python / etc. using libraries like Ggplot2, Plotly, Matplotlib, etc. helps.
Next comes checking outliers, missing values, randomness, analysis & plotting various of charts based on type of data whether categorical or continuous. This is statistical analysis & visualization where I find R to be most suited.
Building models around our data analysis steps could involve linear, non-linear models and checking values via hypothesis testing and mapping to algorithms to process, predict, cluster, find trends and so on. Products / tools like R / Python with libraries like Scikit learn, Numpy, Pandas, MLR, Caret, Keras, TensorFlow, etc. help here
While running the models take care of cross-validation of data & sensitivity analysis – This can generally be done using some options in model training & testing phase for supervised learning.
Feedback loop to circle and improve data & results, accuracy analysis and improvement, pipeline building, interpretation of results & functional mapping to domain are additional things that we need to consider on top of the basics given in Wikipedia article. Also, things like dimensionality reduction techniques like PCA, SVD and such need to be explored in detail as they are helpful in this analysis.

Additional information on top of what is in Wikipedia article:

Explainable AI / ML – https://en.wikipedia.org/wiki/Explainable_artificial_intelligence
Interpretable ML – https://statmodeling.stat.columbia.edu/2018/10/30/explainable-ml-versus-interpretable-ml/
Tools / languages / products to use: R, Python, Pandas, Numpy, Tableau and so on
EDA – https://en.wikipedia.org/wiki/Exploratory_data_analysis
Which chart to use – https://www.tableau.com/learn/whitepapers/which-chart-or-graph-is-right-for-you
List of charts – https://python-graph-gallery.com/all-charts/
Confirmatory data analysis – https://en.wikipedia.org/wiki/Statistical_hypothesis_testing
Singular Value Decomposition – https://en.wikipedia.org/wiki/Singular_value_decomposition
Dimensionality Reduction – https://en.wikipedia.org/wiki/Dimensionality_reduction

Email me: Neil@TechAndTrain.com

Visit my creations:

www.TechAndTrain.com
www.QandA.in
www.TechTower.in

Analytics, Data Science, Machine Learning

What are we doing in AI / ML / Data Science / Decision Science / Analytics World? – Glossary

February 17, 2020 Neil Harwani Leave a comment

Over the last few years I have explored, programmed, worked in, researched and taught Data Science / AI / ML / Analytics / Decision Science to multiple students and with many software professionals. I have collected many keywords that you can google and explore. This will help you to keep pace and learn about things happening is these areas. It’s like a glossary of words to search over internet. It’s a mix and match of technologies, algorithms, concepts, AI / ML / Information Technology terms, BigData words and so on in no particular order. I will keep expanding this till it’s a relatively exhaustive list.

Automatic Machine Learning
Transfer Learning
Explainable Machine Learning
Keras
PyTorch
MLR
R
Python
Ggplot2
MathplotLib
MLib
Spark
Hadoop
Tableau
Chatbots
Talend
MongoDB
Neo4j
Kafka
ELK
NoSQL
Cassandra
AWS SageMaker
SVM
Decision Trees
Regression: Logistic, Multiple, Simple Linear, Polynomial
Scikit Learn
KNIME
BERT
NLG
NLP
Random Forest
Hyper parameters
Boosting
Association rules / mining – Apriori, FP-Growth
Data mining
OpenCV
Self driving cars
AI / Memory embedded SOCs, GPUs, TPUs
Neural engine chipsets
Neural Networks
Deep Learning
EDA
Statistical & Algorithmic modelling
Sampling
Probability distributions
Hypothesis testing
Intervals, extrapolation, interpolation
Scaling
Normalization
Agents, search, constraint satisfaction
Rules based systems
Semantic net
Propositional logic
Fuzzy reasoning
Probabilistic learning
First order logic
Game theory
Pipeline building
Ludwig
Bayesian belief networks
Anaconda Navigator
Jupyter
Synthetic data
Google dataset search
Kaggle
CNN / RNN / Feed forward / Back propagation / Multi-layer
Tensorflow
Deepfakes
KNN
K means clustering
Naive Bayes
Dimensionality reduction
Feature engineering
Supervised, unsupervised & reinforcement learning
Markov model
Time series
Categorical & Continuous data
Imputation
Data analysis
Classification / Clustering / Trees / Hyperplane
Differential calculus
Testing & training data
Visualization
Missing data treatment
Scipy
Pandas
LightGBM
Numpy
Dplyr
Google Collaboratory
PyCharm
Plotly
Shiny
Caret
NLTK, Stanford NLP, OpenNLP
Artificial intelligence
SQL / PLSQL
Data warehousing
Cognitive computing
Coral
Arduino
Raspberry Pi
RTOS
DARPA Spectrum Challenge
100 page ML book
Equations, Functions, and Graphs
Differentiation and Optimization
Vectors and Matrices
Statistics and Probability
Operations management & research
Unstructured, semi-structured & structured data
Five Vs
Descriptive, Predictive & Prescriptive analytics
Model accuracy
IoT / IIoT
Recommendation Systems
Real Time Analytics
Google Analytics

If you are learning something by googling these topics, feel free to provide suggestions for adding more words here. You are welcome to discuss / suggest on top of this article as well. Thank you for reading.

Email me: Neil@TechAndTrain.com

Visit my creations:

www.TechAndTrain.com
www.QandA.in
www.TechTower.in

Education

Changes in India’s education system in last few years

February 15, 2020 Neil Harwani Leave a comment

Institutions of Eminence declared – Complete autonomy given to them
University status for IIMs, NITs, IIITs, AIIMS, etc. via Institutions of National Importance route
Graded autonomy for UGC affiliated institutions – Based on their accreditation score, they can offer online, distance courses and will have autonomy in academics, faculty recruitment, etc.
Graded autonomy for AICTE affiliated institutions – Based on their accreditation score, they can offer online, distance courses and will have autonomy in academics, faculty recruitment, etc.
MCA shortened to two years from three years – It’s now mapped to a standard university master’s degree of two years
Online degrees approved – Degrees like MBA, MCA, PGDM, etc. are being offered online
Rationalization in engineering colleges – Colleges with majority empty seats are being closed with no approvals for new applications by colleges for next few years
CGPA system now introduced in almost all universities and colleges
Merged single regulator & National Education Policy likely to be finalized in next few months
Executive education programs are getting approvals
Hybrid courses by institutions of eminence & institutes of national importance are starting like Executive MTechs, Executive MBAs, etc. which can be done with your routine job
Foreign collaboration with universities & colleges across the world is becoming easier
Deemed universities with high score in accreditation will not require approvals for open & distance learning courses

Email me: Neil@TechAndTrain.com

Visit my creations:

www.TechAndTrain.com
www.QandA.in
www.TechTower.in

Analytics

Three waves of Analytics – Notes on articles by Prof. Davenport

February 13, 2020 Neil Harwani Leave a comment

References:

ANALYTICS 1.0 – Business Intelligence, RDBMS & Data Warehousing

Vertical scaling
Better results and analysis meant higher processing power & memory
Complex systems
Chances of singular failure
Backup was compulsory
Storage in RDBMS
Transformation in business dimensions and facts in Data Warehouse
Descriptive analytics mainly

ANALYTICS 2.0 – BigData, Hadoop, NoSQL & Spark – In memory computing

Problems with Analytics 1.0

Costly hardware
Large amounts of data
Unstructured data

Solution

BigData
Hadoop – Large files
NoSQL – Small files or less size data
Horizontal scaling

Problems with BigData

Querying unstructured data
Large amount of data for real time processing not batch processing

Solution

PIG
HIVE
Spark – In-memory computing
Predictive analytics mainly

ANALYTICS 3.0 – Edge Computing, Data Rich Organizations, Real Time Analytics & more

Problems with Analytics 2.0

Most analysis was retrospective and for past data
Organization wide data also started getting collected but was unused
Real time data started to flow in big amounts

Solution

Data rich organizations
Use data from organization to build products not just mapped to market but also with own organization
E.g. Differentiated products in manufacturing to compete with mass economies of scale production
Edge computing
Real time processing
Combined data
Embedded analytics
Data discovery
Cross functional teams
Moving to Prescriptive & Real Time analytics

Email me: Neil@TechAndTrain.com

Visit my creations:

www.TechAndTrain.com
www.QandA.in
www.TechTower.in

Innovation Ideas blog

Monthly Archives: February 2020

Data Analysis Process in Analytics / Data Science

What are we doing in AI / ML / Data Science / Decision Science / Analytics World? – Glossary

Changes in India’s education system in last few years

Three waves of Analytics – Notes on articles by Prof. Davenport

Ideas on Innovation around Software. We Thrive On Ideas. We are Learner Centered, Open Source & Digital Focused.

Share this:

Share this:

Share this:

Share this:

Ideas on Innovation around Software. We Thrive On Ideas. We are Learner Centered, Open Source & Digital Focused.