Essential Python Libraries and AI Project Development

Posted on Dec 20, 2024 in Mathematics

NumPy: Numerical Python

NumPy, which stands for Numerical Python, is a Python library that provides functionality to do the following:

Create multidimensional array objects (called ndarrays or NumPy arrays).
Provide tools for working with ndarrays.

NumPy Arrays

An array, in general, refers to a named group of homogeneous elements. A NumPy array is simply a grid that contains values of the same type.

import numpy as np

list = [1, 2, 3, 4]

a1 = np.array(list)

print(a1)

Although NumPy arrays look similar to Python lists, they offer significant advantages for numerical operations.

QQAEQAAEQEBPAsJ4scITKNAAARAAARDQmcAfYC5LXpRf6CUAAAAASUVORK5CYII=

Matplotlib: Data Visualization in Python

Matplotlib is a Python library used for data visualization, an essential component of data science.

import matplotlib as mp1

You can create bar plots, scatter plots, histograms, and many more visualizations with Matplotlib.

Data Visualization

Data visualization refers to the graphical or visual representation of information and data using visual elements like charts, graphs, and maps.

For data visualization in Python, the Matplotlib library’s Pyplot interface is used. Pyplot is a collection of methods within the Matplotlib library that allows users to construct 2D plots easily and interactively.

1. Line Chart

A line chart or line graph displays information as a series of data points called “markers.” A line chart is created using the plot() function.

2. Bar Chart

A bar chart or bar graph presents categorical data with rectangular bars with heights or lengths proportional to the values they represent. With Pyplot, a bar chart is created using the bar() and barh() functions.

3. Scatter Plot

The scatter plot is similar to a line chart. The major difference is that while a line graph connects the data points with a line, a scatter chart simply plots the data points to show the trend in the data. With Pyplot, a scatter chart is created using the scatter() function.

4. Pie Chart

A pie chart is a circular statistical graphic divided into slices to illustrate numerical proportion. With Pyplot, a pie chart is created using the pie() function.

5. Histogram Plot

A histogram is a type of graph that provides a visual interpretation of numerical data by indicating the number of data points that lie within a range of values. With Pyplot, a histogram is created using the hist() function.

6. Box Plot Chart

A box plot is the visual representation of the statistical five-number summary of a given data set. With Pyplot, a box plot is created using the boxplot() function.

Statistical Learning and Data Processing

Data processing mainly aims at transforming data into a form most suited for an application. There are many statistical techniques used for data processing. Some basic and commonly used statistical techniques are useful for different types of measures.

Stemming algorithms work by cutting off the end or the beginning of a word, taking into account a list of common prefixes and suffixes that can be found in an inflected word.

Lemmatization, on the other hand, takes into consideration the morphological analysis of words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma.

What is a Dictionary in NLP?

A dictionary in NLP is a list of all the unique words occurring in the corpus. If some words are repeated in different documents, they are all written just once while creating the dictionary.

What is Term Frequency?

Term frequency is the frequency of a word in one document. Term frequency can easily be found from the document vector table, as that table mentions the frequency of each word of the vocabulary in each document.

Which Package is Used for NLP in Python?

The Natural Language Toolkit (NLTK) is one of the leading platforms for building Python programs that can work with human language data.

What is a Document Vector Table?

A document vector table is used while implementing the Bag of Words algorithm. In a document vector table, the header row contains the vocabulary of the corpus, and other rows correspond to different documents. If the document contains a particular word, it is represented by 1, and the absence of a word is represented by 0.

What is TF-IDF?

Term frequency-inverse document frequency (TF-IDF) is a numerical statistic intended to reflect how important a word is to a document in a collection or corpus. The number of times a word appears in a document divided by the total number of words in the document. Every document has its own term frequency.

ZZuKigp83jUZ5ifEGACKE1CAjG4iACium+MKoiIQABQnoAAZ3UQAUFw3xxVERSAAKE5AATK6iQCguG6OK4iKQABQnIACZHQTAUBx3RxXEBWBAKA4AQXI6CYCgOK6Oa4gKgIBQHECCpDRTQQAxXVzXEFUBAKA4gQUIKObCACK6+a4gqgIBADFCShARjcRABTXzXEFUREIAIoTUICMbiLwHxKJqM1IeWQ6AAAAAElFTkSuQmCC

Does the Vocabulary of a Corpus Remain the Same Before and After Text Normalization?

No, the vocabulary of a corpus does not remain the same before and after text normalization. Reasons are:

In normalization, the text is normalized through various steps and is lowered to minimum vocabulary since the machine does not require grammatically correct statements but the essence of it.
In normalization, stop words, special characters, and numbers are removed.
In stemming, the affixes of words are removed, and the words are converted to their base form.

So, after normalization, we get a reduced vocabulary.

Significance of Converting Text into a Common Case

In text normalization, we undergo several steps to normalize the text to a lower level. After removing stop words, we convert the whole text into a similar case, preferably lowercase. This ensures that the case sensitivity of the machine does not consider the same words as different just because of different cases.

Applications of Natural Language Processing

Sentiment Analysis
Chatbots & Virtual Assistants
Text Classification
Text Extraction
Machine Translation
Text Summarization
Market Intelligence
Auto-Correct

Plot of Occurrence of Words Versus Their Value

As shown in the graph, the occurrence and value of a word are inversely proportional. The words which occur most (like stop words) have negligible value. As the occurrence of words drops, the value of such words rises. These words are termed as rare or valuable words. These words occur the least but add the most value to the corpus.

mACmAAmgAlgAoMk8P8ANJp4JOYIZr4AAAAASUVORK5CYII=

Sustainable Development Goals (SDGs)

1. No Poverty

This is Goal 1 and strives to end poverty in all its forms everywhere globally by 2030. The goal has a total of seven targets to be achieved.

2. Quality Education

This is Goal 4, which aspires to ensure inclusive and equitable quality education and promote lifelong learning opportunities for all. It has ten targets to achieve.

The Need for an AI Project Cycle

A project cycle is the process of planning, organizing, coordinating, and finally developing a project effectively throughout its phases, from planning through execution, then completion and review, to achieve predefined objectives.

Our mind makes up plans for every task we have to accomplish, which is why things become clearer in our minds. Similarly, if we have to develop an AI project, the AI Project Cycle provides us with an appropriate framework that can lead us towards the goal.

The major role of the AI Project Cycle is to distribute the development of an AI project into various stages so that the development becomes easier and clearly understandable. The steps/stages should become more specific to efficiently get the best possible output. It mainly has five ordered stages: Problem Scoping, Data Acquisition, Data Exploration, Modeling, and Evaluation.