Chapter��3.��KnowledgeDex - Secure Enterprise Search

Affordable Enterprise Search for the Rest of Us

Description. Provides crawling technology that can index silos of information across security domains into a single full-text index.

Also provides the user side search tools that give "on the fly" results, full web-page results, and advanced search criteria.

A search result set is generated as a two step process: 1) a set of all documents meeting the criteria is build on the server, 2) the result set is filtered with the users access rights which produces a sub-set of document that the user has rights to see. This final set is passed back to the user.

[Note]To Author

Need to add Search Document concept here

Solves. Brings the confidential knowledge assets, frequently referred to as Information Silos, of a corporation, which are protected by an explicit security domain, under the same full-text search technology that has changed the world. Tests the results against the searcher's security to provide only those links they are allowed access to.

Features. KnowledgeDex key features include:

  • The ability to securely search and locate public, private and shared content across the Internet web content, databases, document repositories, and application business objects

  • Excellent search quality, with the most relevant items for a query spanning diverse sources being shown first

  • Highly secure crawling, indexing, and searching

  • Very fast query performance

  • On-the-fly search results as you type criteria

  • Ease to administrate and maintain

  • SQI will customise to crawl your proprietary applications

  • Can be implemented on SQI server as a Saas component or on clients server

  • Affordable

3.1.��Overview

KnowledgeDex, the Enterprise Search core technology, provides secure search across the content of all Knowledge Base Applications. The indexing engine understands the data structures of the Knowledge Base so only the latest version of content is index. The powerful search engine produces relevant, scored results that are filtered by the access permissions of the requester at page or document-level security. The results set shows only what the user has rights to access.

3.1.1.��Searching Information Silos

Full text search technology to powers the internet and has changed the way knowledge is acquired. However, bring this tool to the knowledge repositories with a corporation is currently very expensive and a burden on internal IT staffs. These factors have limited the growth enterprise search within Corporate America and made it almost non-existent in small and emerging firm. Thus the most powerful technology in the last decade is not available to harvest internal knowledge.

Figure��3.1.��Enterprise Search Crosses Security Domains

Enterprise Search Crosses Security Domains

Figure��3.1, ���Enterprise Search Crosses Security Domains��� shows the major issue in Enterprise Search - multiple Security Domains. Each corporate application becomes an information silo with its own set of authorised users, passwords, etc. To bring the full knowledge of the firm to bear on client needs the Enterprise Search must be able to cross all the different security domains and make that process look seamless to the End User.

3.2.��End User Interface

The End User Interface to KnowledgeDex is a very familiar Google like Tool Bar or Full Page view. The Tool Bar view is presented in Section��3.2.1, ���On-the-Fly Search Interface���. The Full Page interface in Section��3.2.2, ���Full Search Page���.

3.2.1.��On-the-Fly Search Interface

Whan KnowledeDex is access via search box component, as in the AppBar shown in Figure��3.2, ���KnowledgeDex Search from AppBar���, it provides ''real-time'' incremental results at every criteria input pause.

Figure��3.2.��KnowledgeDex Search from AppBar

KnowledgeDex Search from AppBar

Figure��3.3, ���On-the-fly results as you enter criteria��� shows an example where the End User has entered "knowledge" as the first search term and paused for a moment. The one second pause triggers an On-the-Fly results set to be generated. The results set show a “goodness-of-fit” index and the search Document titles for the best 10 fits.

Figure��3.3.��On-the-fly results as you enter criteria

On-the-fly results as you enter criteria

At this point the User can scan the results and select a Search Document to look at, or continue refining the search.

Figure��3.4, ���Next search term produces an new results set on-the-fly��� shows the User entering another search term. At a one second pause a new results set is produced that shows the top ten Search Documents fits for the now two word criteria.

Figure��3.4.��Next search term produces an new results set on-the-fly

Next search term produces an new results set on-the-fly

At any point the User can select to go to a full page search similar to the Google full page (see next to last entry - Perform a full search - in the dropdown shown in Figure��3.4, ���Next search term produces an new results set on-the-fly���. This is shown in the next section.

3.2.2.��Full Search Page

Figure��3.5, ���Full Search Page��� shows the End User view of the full page search. this view adds the Information Silo that produced the Search Document. This view also lets the User walk thought a large results set page-by-page.

Figure��3.5.��Full Search Page

Full Search Page

The above example in Figure��3.5, ���Full Search Page��� shows a simple term only search criteria (see text box on line two next to Go button). KnowledgeDex is capable of supporting advanced search criterion like restricting the search to specific Information Silos. This is the interface that would be used to take implement more powerful searches.

3.3.��Technology

The KnowledgeDex platform include includes three major components:

  • Crawler. This is the "front end" component that access the content to be indexed. Section��3.3.1, ���Crawler��� presents the features and complexity of accessing enterprise content. The Crawler is fully SQI code.

  • Indexing and Search Engine. This is the text indexing and search technology. It is a major Open Source technology developed supported by the Apache Organization. SQI configures and hosts this technology. In addition SQI has added RPC (Remote Procedure Call) interface.

  • Query and Rights Filter.  This is the "back end" component that accepts the End User search criteria, filters the results set for access rights and presents the ranked "hits" to the End User. Section��3.3.2, ���Query Time Authorization Filter��� presents th features and capabilities. This component is fully SQI code.

3.3.1.��Crawler

The Crawler is the "front end" technology to the indexing and search platform. The Crawler is responsible for extracting the text to be indexed from each of the information and knowledge silos within the firm.

The Crawler has two major tasks: determine what data sources to visit, and extracting the text content from each source. For example, the Google Crawler, after being given a large number of starting web sites, determines where to go next by following the links on the web page. Also, the Google Crawler at each web page will extract the text from that page and submit it to the search engine for indexing.

Following links works for public web pages but other information silos such as corporate document storage system, or support systems, or legacy business application require other processes for determining what content to visit. A common solution is an API into the application that can return all the information units in that applications. For example the Crawler could ask the support system for all the tickets that had changes from the last visit. Then visit each ticket via the API and update the search engine with the new content.

3.3.1.1.��The Search Document

Full text search engines work on the concept of a Search Document as the owner of a set of text and some meta data (information about the Document). A common example is a web page as a Document. During the indexing process the works on the web page are added to the index and associated with the URL of the page. When a search matches the indexed content of the web page the associated URL is returned as a match.

A web page os only one type of content that can be mapped to a Search Document. A PDF file out on the web is another very common mapping. In this case the PDF file is processed to extract the text, the text is indexed by the search engine, and the URL of the PDF is associated with the Search Document.

Other types of data can also be mapped into a Search Document. Examples are support database content, supplier data in corporate supply chain systems, and sales / customer data in legacy systems. Every piece on corporate information the Knowledge Workers may need is a candidate for the Enterprise Search database.

3.3.1.2.��Content Smart Crawler

The the range and complexities of content structures, as described in Section��3.3.1.1, ���The Search Document��� above, dictates the need for a powerful and flexible Crawler that can extract and index all the knowledge source in the company.

A Crawler can add significant vale to the indexing process by: trim non-relevant material, and aggregate associated content into a single Search Document. The trimming process improves indexing by removing text that couple product a search "hit" in error by matching words outside of the real topic content. Aggregation improves index by pulling together a stream on content associates with a topic thus increasing the range of search terms that will produce a search match.

[Note]Customization ToDo - Steve

Modify the CSS to at low-level heading to manual look. See Trim Non-relevant Material below.

3.3.1.2.1.��Trim Non-relevant Material

Web pages, especially public, have extensive amounts on material that is not the actual content for the page.

Figure��3.6.�� Page Content as Small Part of Overall Material

Page Content as Small Part of Overall Material

If the non-content text is indexed future search results can be degraded.

3.3.1.2.2.��Aggregate associated content into a single Search Document.

In some application the Search Document that needs to be indexed is an aggregation of content from a number of records. For example, in an email support ticketing system, the Search Document should be the full stream of responses between the participant.

Figure��3.7, ���Part of Email Stream on specific Support Issue��� shows three emails in a stream of emails related to a specific support issue. In this example the Crawler's task is to create a Search Document that has all the text in the correspondence stream. Thus the Crawler is requested to know how to access the set of emails from the support database that are associated with the support issue.

Figure��3.7.��Part of Email Stream on specific Support Issue

Part of Email Stream on specific Support Issue

Pulling the content from all the email associated with a single support issues into one Search Document will provide significantly better search results because both the terms used by the end users to initially describe the issues and the final technical terms that provided the final solution are in the same Search Document. Thus a query can find the document by using end user language or support tech language.

3.3.1.3.��Access Authorization Super User

The KnowledgeDex Crawler operates at the highest level of access authorization so that all the enterprise information assets stored in the applications being crawled. Thus, every search has access to all the Search Documents that could provide key information.

However, once the result set is produced it has to be filtered for the User's access rights to each Search Document. This is covered in Section��3.3.2, ���Query Time Authorization Filter���

3.3.2.��Query Time Authorization Filter

The final server side process before a results sets is sent to the User is the filtering of the set for access rights. For each Search Document the User's access rights are tested via the application that contributed the document. Thus, the filterer results set has passes User access security as of the current moment.

[Note]Note
  • Bets search matches across all security domains

  • validate right ot access for each Search Document

    Uses up-to-the-moment security status. does not story any access rights in Search Document which can get out of date.

  • Show only validated documents

3.4.��Implementation

The core concept of the implementation plan is that SQI will design, implement and support the Enterprise Search infrastructure. At some point the system is completely turned over to the client.

[Note]Note

All of the KnowledgeDex software components are open source code. the Lucene Search Engine is an Apache Open Source license. all SQI code is under the MIT Open Source license. This means the client is free to use and modify the environment as they see fit.

The implementation is a five step process.

  1. Feasibility Review. The work product will be a Design Document covering:

    • The structure of the Search Document for each Information Silo.

    • Crawler "shim" design for each Information Silo.

    • Lucene extensions and configuration.

    • Authorization filter "shim" design for each Information Silo

  2. Search Document Design. ????

  3. Code Crawler Enhancements. ???

  4. Search Engine Configuration. 

  5. Code Authorization Filter. ???

The Crawler and Authorizing shims have to be implemented on the client's IT systems. the Lucene Search Engine, the Crawler, and Authorizing are run on a Linux based system that can be:

  • Hosted. Implemented on Linux Virtual Machine (VM) in the SQI data center. This environment can be re-implemented on client system later.

  • Managed. Implemented on a dedicated Linux system in the SQI data center. This can easily be moved to client site later.

  • Managed. Implemented on a dedicated Linux system at the client's site by SQI. Systems is managed and maintained by SQI until hand-off.

3.5.��Reference Material

The Google and Oracle references provide a competitive and problem framework for enterprise search market segment. They also provide an excellent overview of the features and language used to address this market. The Apache Lucene reference presents the underlying technology of KnowledgeDex.

  • Google Search Appliance web pageThis page presents introduction material for the Google turn key system. Three videos provide an excellent overview of what enterprise search encompasses. Also, of major impotents is the starting price of $30,000 for the Google solution.

  • Oracle Secure Enterprise Search 10g web pageOracle Secure Enterprise Search (SES) 10g, a standalone product from Oracle, enables a secure, high quality, easy-to-use search across all enterprise information assets.

    This home page for the Oracle Secure Enterprise Search has a wealth of marketing and technical material. Although the material is stated in SAS terms, the majority is applicable to the general Enterprise Search market segment.

  • Apache Lucene Overview web pageApache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

InfraComps/ChapEntSearch (last edited 2015-03-06 18:11:27 by localhost)