Challenges and Strategies for Effective Web Data Mining

61) In what ways does the Web pose great challenges for effective and efficient knowledge discovery through data mining?

Answer:

  • The Web is too big for effective data mining. The Web is so large and growing so rapidly that it is difficult to even quantify its size. Because of the sheer size of the Web, it is not feasible to set up a data warehouse to replicate, store, and integrate all of the data on the Web, making data collection and integration a challenge.
  • The Web is too complex. The complexity of a Web page is far greater than a page in a traditional text document collection. Web pages lack a unified structure. They contain far more authoring style and content variation than any set of books, articles, or other traditional text-based document.
  • The Web is too dynamic. The Web is a highly dynamic information source. Not only does the Web grow rapidly, but its content is constantly being updated. Blogs, news stories, stock market results, weather reports, sports scores, prices, company advertisements, and numerous other types of information are updated regularly on the Web.
  • The Web is not specific to a domain. The Web serves a broad diversity of communities and connects billions of workstations. Web users have very different backgrounds, interests, and usage purposes. Most users may not have good knowledge of the structure of the information network and may not be aware of the heavy cost of a particular search that they perform.
  • The Web has everything. Only a small portion of the information on the Web is truly relevant or useful to someone (or some task). Finding the portion of the Web that is truly relevant to a person and the task being performed is a prominent issue in Web-related research.

62) What is a Web crawler and what function does it serve in a search engine?

Answer: A Web crawler (also called a spider or a Web spider) is a piece of software that systematically browses (crawls through) the World Wide Web for the purpose of finding and fetching Web pages. Often Web crawlers copy all the pages they visit for later processing by other functions of a search engine.

63) What is search engine optimization (SEO) and why is it important for organizations that own Web sites?

Answer: Search engine optimization (SEO) is the intentional activity of affecting the visibility of an e-commerce site or a Web site in a search engine’s natural (unpaid or organic) search results. In general, the higher ranked on the search results page, and more frequently a site appears in the search results list, the more visitors it will receive from the search engine’s users. Being indexed by search engines like Google, Bing, and Yahoo! is not good enough for businesses. Getting ranked on the most wide used search engines and getting ranked higher than your competitors are what make the difference.

64) What is the difference between white hat and black hat SEO activities?

Answer: An SEO technique is considered white hat if it conforms to the search engines’ guidelines and involves no deception. Because search engine guidelines are not written as a series of rules or commandments, this is an important distinction to note. White-hat SEO is not just about following guidelines, but about ensuring that the content a search engine indexes and subsequently ranks is the same content a user will see. Black-hat SEO attempts to improve rankings in ways that are disapproved by the search engines, or involve deception or trying to trick search engine algorithms from their intended purpose.

65) How would you define clickstream analysis?

Answer: Clickstream analysis is the analysis of information collected by Web servers to help companies understand user behavior better. By using the data and text mining techniques, companies can frequently discern interesting patterns from the clickstreams. Data collected from clickstreams include user data, session data, which pages they viewed and when and how often they visited. Knowledge extracted from clickstreams includes usage patterns, user profiles, page profiles, visit profiles and customer value.

66) Why are the users’ page views and time spent on your Web site important metrics?

Answer: If people come to your Web site and don’t view many pages, that is undesirable and your Web site may have issues with its design or structure. Another explanation for low page views is a disconnect in the marketing messages that brought them to the site and the content that is actually available. Generally, the longer a person spends on your Web site, the better it is. That could mean they’re carefully reviewing your content, utilizing interactive components you have available, and building toward an informed decision to buy, respond, or take the next step you’ve provided. On the contrary, the time on site also needs to be examined against the number of pages viewed to make sure the visitor isn’t spending his or her time trying to locate content that should be more readily accessible.

67) How is a conversion defined on an organization’s Web site? Give examples.

Answer: Each organization defines a “conversion” according to its specific marketing objectives. Some Web analytics programs use the term “goal” to benchmark certain Web site objectives, such as a certain number of visitors to a page, a completed registration form, or an online purchase.

68) What is the Voice of the customer (VOC) strategy? List and describe its 4 steps.

Answer: Voice of the customer (VOC) is a term usually used to describe the analytic process of capturing a customer’s expectations, preferences, and aversions. It essentially is a market research technique that produces a detailed set of customer wants and needs, organized into a hierarchical structure, and then prioritized in terms of relative importance and satisfaction with current alternatives.

  • Listen encompasses both the capability to listen to the open Web (forums, blogs, tweets, you name it) and the capability to seamlessly access enterprise information (CRM notes, documents, e-mails, etc.).
  • Analyze This is taking all of the unstructured data and making sense of it. Solutions include keyword, statistical, and natural language approaches that will allow you to essentially tag or barcode every word and the relationships among words, making it data that can be accessed, searched, routed, counted, analyzed, charted, reported on, and even reused.
  • Relate After finding insights and analyzing unstructured data, here you connect those insights to your “structured” data about your customers, products, parts, locations and so on.
  • Act In this step, you act on the new customer insight you’ve obtained.

69) What are the three categories of social media analytics technologies and what do they do?

Answer:

  • Descriptive analytics: Uses simple statistics to identify activity characteristics and trends, such as how many followers you have, how many reviews were generated on Facebook, and which channels are being used most often.
  • Social network analysis: Follows the links between friends, fans, and followers to identify connections of influence as well as the biggest sources of influence.
  • Advanced analytics: Includes predictive analytics and text analytics that examine the content in online conversations to identify themes, sentiments, and connections that would not be revealed by casual surveillance.

70) In social network analysis, who are your most powerful influencers and why are they important?

Answer: Your most important influencers are the ones who influence the whole realm of conversation about your topic. You need to understand whether they are saying nice things, expressing support, or simply making observations or critiquing. What is the nature of their conversations? How is my brand being positioned relative to the competition in that space?