Data Governance is an important step when using a knowledge retrieval platform like Genow. We offer different features like the feedback dashboard and quality insights to help you with data governance during the usage of the platform.In this article you will find out, which data types we support and how to improve your data quality.
The Genow platform’s ability to extract information from documents and data depends on the quality of the underlying data and documents. For this reason, a good data structure must be created. This documentation provides basic guidelines and shows which data can be extracted.
How Data Is Extracted
When data is first synchronized, it is processed to make it available for generating responses.- During this process, data and information from files are analyzed and interpreted to understand their structure and convert it into a suitable format (technical term: parsing). We combine various methodologies, some developed in-house and others from external service providers like Google.
- We use the vectorization technology. This technology ‘understands’ the meaning of your text content and translates it into unique digital codes (‘addresses’). When you start a search or ask a question, our system can immediately find the most relevant sections in your knowledge base or documents – without manual searching. Processing data incurs costs. Find out more here:
Which Data Sources Are Supported?
We support automated data extraction with data synchronisation from- SharePoint
- Jira
- Confluence
- Google Cloud Buckets
- Coming soon: Google Drive and ServiceNow
Your data is stored in an external system? No problem! We can develop an individual connector. Contact us! Find out about options to further develop and individualise your use case here.
Special Notes for Pipeline Optimisations
There are various optimisation options for data pipelines which, if implemented correctly, can lead to quality improvements and higher user satisfaction. Via the following link you can find a a brief explanation of selected optimisation options and settings: You can find further information about Use Case Optimisations and Settings here.Data Maintenance - General Guidelines
| Area | Good (Do) | Bad (Don’t) | Why Important for AI? |
|---|---|---|---|
| Structure | Logical folder structure, clear hierarchy | Too deep/flat structures, vague folder names (“Miscellaneous”) | Context understanding, faster finding of relevant data |
| Filenames | Descriptive, meaningful, consistent | Nondescript (Doc1.pdf), unclear (final_final.docx) | Content indication, relevance assessment |
| Currency | Remove/archive/label outdated content, regular review | Keep many old versions in working area | Avoid outdated responses |
| Relevance | Only relevant data in access area, archive/remove irrelevant | Mix private/irrelevant files | Faster search, avoid inappropriate results, unnecessary costs |
| Duplicates | Central storage, use links instead of copies | Store same file in multiple locations | Uniqueness, consistency of responses |
| Maintenance | Regular process, continuous improvement | One-time action, then neglect | Long-term maintenance of data quality and AI performance |
Note: The fewer irrelevant pieces of information in the documents, the more efficient Genow’s response output will be. Please optimize your data using this guideline.
Supported Data Types
Supported File Formats
| File Type | Detected Elements | Limitations and Recommendations |
|---|---|---|
| HTML | Paragraph, table, list, title, heading, header, footer | Parsing heavily relies on HTML tags. CSS-based formatting may not be captured. |
| Paragraph, table, title, heading, header, footer | Tables spanning multiple pages may be split into two tables. Generally, the limit is no more than 500 pages and 40 MB in size, ideally no more than 20 MB. However, during the sync process, any previously too large PDFs will be split into sub-documents automatically. We recommend adding data in the appropriate place rather than using footnotes. | |
| DOCX | Paragraphs, tables across multiple pages, lists, titles, headings | Nested tables are not supported. Regarding FAQ-documents: If your document contains many different topics in very small text sections, you may want to consider using an (Excel) table instead of floating text. |
| PPTX | Paragraphs, tables, lists, titles, headings | Headings must be properly marked in the PowerPoint file. Nested tables and hidden slides are not supported. Speaker notes are not read in PPTX format, but can be included when converting to PDF. |
| XLSX/XLSM | Tables in Excel sheets supporting INT-, FLOAT-, and STRING values | Multiple table detection is not supported. Hidden tables, rows, or columns may affect detection. Additionally: Headers must exist and be marked, references, scripts, etc. are not recognized. For multiple tables in one sheet: please create separate Excel files. |
- We also support SharePoint Pages and -Sites. Please note that you can only select or deselect them all. SharePoint Forms and Lists are not supported.
- Images in PDFs and Word files: Images are supported and understood. Limitations exist if on the quality of the images is low or their complexity is high. A way to quickly test this, is to upload the file to the personal space and ask a specific question which can only be answered from information on specific images.
Additional Limitations
- Maximum input file size for all file types: 20 MB
- We recommend converting to PDF files as this improves readability and storage size as well as loading time in the user UI. This is, however, not mendatory.
- Hyperlinks often are not recognized. If you want hyperlinks to be available, please write the links out.
- Tables in documents: best if you use a standard layout of tables and provide legends. Especially the case if you have special layouts with special designs.
- SharePoint data cannot be selected on a document level. Place single items you want to select in folders.
- SharePoint Pages and DriveItems can be selected either all or none for a SharePoint site.
Unsupported File Formats - Action Items:
- Videos: We recommend converting to transcripts with the most relevant content. Relevant presentation slides can be captured separately or included in the PDF file with the transcript.
- Program code: in JSON files or similar formats: Cannot be processed yet. Individual code snippets can be inserted directly into the chat and processed by language models. However, entire codebases or files cannot be processed.
- Audio files: We recommend converting the most relevant parts to transcripts
- Other file formats not listed in the table above are not compatible. Feel free to ask us for recommendations or feature requests.
