Skip to main content
Data Governance is an important step when using a knowledge retrieval platform like Genow. We offer different features like the feedback dashboard and quality insights to help you with data governance during the usage of the platform.
In this article you will find out, which data types we support and how to improve your data quality.
The Genow platform’s ability to extract information from documents and data depends on the quality of the underlying data and documents. For this reason, a good data structure must be created. This documentation provides basic guidelines and shows which data can be extracted.
Even after the test phase, use case admins should maintain and supplement data. Users can provide feedback on missing or incorrect information.

How Data Is Extracted

When data is first synchronized, it is processed to make it available for generating responses.
  • During this process, data and information from files are analyzed and interpreted to understand their structure and convert it into a suitable format (technical term: parsing). We combine various methodologies, some developed in-house and others from external service providers like Google.
  • We use the vectorization technology. This technology ‘understands’ the meaning of your text content and translates it into unique digital codes (‘addresses’). When you start a search or ask a question, our system can immediately find the most relevant sections in your knowledge base or documents – without manual searching. Processing data incurs costs. Find out more here:

Which Data Sources Are Supported?

We support automated data extraction with data synchronisation from
  • SharePoint
  • Jira
  • Confluence
  • Google Cloud Buckets
  • Coming soon: Google Drive and ServiceNow
Your data is stored in an external system? No problem! We can develop an individual connector. Contact us! Find out about options to further develop and individualise your use case here.

Special Notes for Pipeline Optimisations

There are various optimisation options for data pipelines which, if implemented correctly, can lead to quality improvements and higher user satisfaction. Via the following link you can find a a brief explanation of selected optimisation options and settings: You can find further information about Use Case Optimisations and Settings here.

Data Maintenance - General Guidelines

AreaGood (Do)Bad (Don’t)Why Important for AI?
StructureLogical folder structure, clear hierarchyToo deep/flat structures, vague folder names (“Miscellaneous”)Context understanding, faster finding of relevant data
FilenamesDescriptive, meaningful, consistentNondescript (Doc1.pdf), unclear (final_final.docx)Content indication, relevance assessment
CurrencyRemove/archive/label outdated content, regular reviewKeep many old versions in working areaAvoid outdated responses
RelevanceOnly relevant data in access area, archive/remove irrelevantMix private/irrelevant filesFaster search, avoid inappropriate results, unnecessary costs
DuplicatesCentral storage, use links instead of copiesStore same file in multiple locationsUniqueness, consistency of responses
MaintenanceRegular process, continuous improvementOne-time action, then neglectLong-term maintenance of data quality and AI performance
Note: The fewer irrelevant pieces of information in the documents, the more efficient Genow’s response output will be. Please optimize your data using this guideline.

Supported Data Types

Supported File Formats

File TypeDetected ElementsLimitations and Recommendations
HTMLParagraph, table, list, title, heading, header, footerParsing heavily relies on HTML tags. CSS-based formatting may not be captured.
PDFParagraph, table, title, heading, header, footerTables spanning multiple pages may be split into two tables. Generally, the limit is no more than 500 pages and 40 MB in size, ideally no more than 20 MB. However, during the sync process, any previously too large PDFs will be split into sub-documents automatically. We recommend adding data in the appropriate place rather than using footnotes.
DOCXParagraphs, tables across multiple pages, lists, titles, headingsNested tables are not supported. Regarding FAQ-documents: If your document contains many different topics in very small text sections, you may want to consider using an (Excel) table instead of floating text.
PPTXParagraphs, tables, lists, titles, headingsHeadings must be properly marked in the PowerPoint file. Nested tables and hidden slides are not supported. Speaker notes are not read in PPTX format, but can be included when converting to PDF.
XLSX/XLSMTables in Excel sheets supporting INT-, FLOAT-, and STRING valuesMultiple table detection is not supported. Hidden tables, rows, or columns may affect detection. Additionally: Headers must exist and be marked, references, scripts, etc. are not recognized. For multiple tables in one sheet: please create separate Excel files.
Additions:
  • We also support SharePoint Pages and -Sites. Please note that you can only select or deselect them all. SharePoint Forms and Lists are not supported.
  • Images in PDFs and Word files: Images are supported and understood. Limitations exist if on the quality of the images is low or their complexity is high. A way to quickly test this, is to upload the file to the personal space and ask a specific question which can only be answered from information on specific images. 

Additional Limitations

  • Maximum input file size for all file types: 20 MB
  • We recommend converting to PDF files as this improves readability and storage size as well as loading time in the user UI. This is, however, not mendatory.
  • Hyperlinks often are not recognized. If you want hyperlinks to be available, please write the links out.
  • Tables in documents: best if you use a standard layout of tables and provide legends. Especially the case if you have special layouts with special designs.
  • SharePoint data cannot be selected on a document level. Place single items you want to select in folders.
  • SharePoint Pages and DriveItems can be selected either all or none for a SharePoint site.

Unsupported File Formats - Action Items:

  • Videos: We recommend converting to transcripts with the most relevant content. Relevant presentation slides can be captured separately or included in the PDF file with the transcript.
  • Program code: in JSON files or similar formats: Cannot be processed yet. Individual code snippets can be inserted directly into the chat and processed by language models. However, entire codebases or files cannot be processed.
  • Audio files: We recommend converting the most relevant parts to transcripts
  • Other file formats not listed in the table above are not compatible. Feel free to ask us for recommendations or feature requests.

Special Considerations for the Personal Space

As a use case admin you can allow, that users can combine their personal assets with the use case knowledge. Personal assets will only be visible by the user who created their asset.
Personal Assets have additional guidelines regarding processed file sizes. The size of information in a Personal Asset is specified in “chunks”. Chunks are divided text segments that allow our AI model to efficiently access and use relevant information from large documents. The actual number of pages that can be processed depends on text size, layout, etc. A Personal Asset has a maximum size of 1000 chunks of data, which corresponds to approximately 1000-2000 pages of text. Additionally, the same requirements apply regarding processed file formats (see table above). If you have a Personal Asset which might be interesting for others as well, you can propose converting it to an actual use case, where we have close to no limitations for storage sizing. All listed supported file types are also supported in the personal space.