Understanding PageRank, DGIM Algorithm, and Hadoop Ecosystem
PageRank Concept
Basic PageRank Formula: PR(A) = (1−d) + d × (C(T1)/PR(T1) + … + C(Tn)/PR(Tn)
Teleporting & Damping Factor
Concept of ‘teleporting’ in random surfer model to avoid issues like rank sinks & spider traps. Damping factor represents the probability that a user will continue clicking on links.
Handling Dead Ends & Spider Traps
Dead ends: Pages with no outbound links, causing ‘leakage’ of PageRank. Spider traps: Pages or groups of pages that only link to each other, trapping the rank within them.
Computational Efficiency
Power Iteration method for large-scale computation.
Shingling
Convert documents to sets of shingles (sequences of tokens). Documents with significant shingle overlap are considered similar.
Min-Hashing
Generate compact signatures for documents that preserve similarity.
Locality-Sensitive Hashing (LSH)
Hash similar signatures into the same buckets to identify candidate pairs.
DGIM Algorithm
Used for estimating the number of ‘1’s in the last N bits of a binary stream. Stores log 2N bits per stream, maintaining efficiency.
Use Cases of Approximate Counting
Big Data Processing, Web Analytics, DNA Sequence Analysis, Social Media Analytics, Counting in Binary Streams.
Stream Sampling
Fixed proportion of a stream, Reservoir sampling for fixed-size samples, Moments in Stream, AMS Method.
Stochastic Gradient Descent (SGD) & Online Learning
Stream Queries, Real-World Applications.
Counting Distinct Elements
Challenges in counting distinct elements in streams due to memory constraints.
Privacy Breaches & AI Impact
Numerous instances of privacy breaches involving unauthorized recordings & data misuse.
AI Safety & Security Standards
New standards for AI safety & security, including the sharing of safety test results & the development of government standards.
US Privacy Laws
Detailed examination of laws like HIPAA, COPPA, & FERPA.
Challenges & Solutions in Data Privacy
The ongoing vulnerabilities & threats to data privacy. Strategies & techniques for protecting sensitive information.
Hadoop Ecosystem
Stores vast amounts of data. Components: storage unit- HDFS; MapReduce; YARN. RDD- backbone of Hadoop.
MapReduce
Solves: It prevents the need to move large amounts of data around by processing the data where it currently resides. Management: The distribution of data is handled by the MapReduce framework itself.
Spark
Saves time as storage is done on memory instead of disk & faster execution. Uses different languages like Python, Scala, Java. RDD- Resilient Distributed Dataset, immutable.
Kafka
Topic: A category or feed name where records are published. Partition: A split of data within a topic for parallel processing. Producer: An entity that sends records to a Kafka topic. Consumer: An entity that reads records from a Kafka topic.