Understanding PageRank, DGIM Algorithm, and Hadoop Ecosystem

Posted on Jan 31, 2024 in Computers

PageRank Concept

Basic PageRank Formula: PR(A) = (1−d) + d × (C(T1)/PR(T1) + … + C(Tn)/PR(Tn)

Teleporting & Damping Factor

Concept of ‘teleporting’ in random surfer model to avoid issues like rank sinks & spider traps. Damping factor represents the probability that a user will continue clicking on links.

Handling Dead Ends & Spider Traps

Dead ends: Pages with no outbound links, causing ‘leakage’ of PageRank. Spider traps: Pages or groups of pages that only link to each other, trapping the rank within them.

Computational Efficiency

Power Iteration method for large-scale computation.

Shingling

Convert documents to sets of shingles (sequences of tokens). Documents with significant shingle overlap are considered similar.

Min-Hashing

Generate compact signatures for documents that preserve similarity.

Locality-Sensitive Hashing (LSH)

Hash similar signatures into the same buckets to identify candidate pairs.

DGIM Algorithm

Used for estimating the number of ‘1’s in the last N bits of a binary stream. Stores log 2N bits per stream, maintaining efficiency.

Use Cases of Approximate Counting

Big Data Processing, Web Analytics, DNA Sequence Analysis, Social Media Analytics, Counting in Binary Streams.

Stream Sampling

Fixed proportion of a stream, Reservoir sampling for fixed-size samples, Moments in Stream, AMS Method.

Stochastic Gradient Descent (SGD) & Online Learning

Stream Queries, Real-World Applications.

Counting Distinct Elements

Challenges in counting distinct elements in streams due to memory constraints.

Privacy Breaches & AI Impact

Numerous instances of privacy breaches involving unauthorized recordings & data misuse.

AI Safety & Security Standards

New standards for AI safety & security, including the sharing of safety test results & the development of government standards.

US Privacy Laws

Detailed examination of laws like HIPAA, COPPA, & FERPA.

Challenges & Solutions in Data Privacy

The ongoing vulnerabilities & threats to data privacy. Strategies & techniques for protecting sensitive information.

Hadoop Ecosystem

Stores vast amounts of data. Components: storage unit- HDFS; MapReduce; YARN. RDD- backbone of Hadoop.

MapReduce

Solves: It prevents the need to move large amounts of data around by processing the data where it currently resides. Management: The distribution of data is handled by the MapReduce framework itself.

Spark

Saves time as storage is done on memory instead of disk & faster execution. Uses different languages like Python, Scala, Java. RDD- Resilient Distributed Dataset, immutable.

Kafka

Topic: A category or feed name where records are published. Partition: A split of data within a topic for parallel processing. Producer: An entity that sends records to a Kafka topic. Consumer: An entity that reads records from a Kafka topic.