Data Maintanance - Genow Helpcenter

Data Governance is an important step when using a knowledge retrieval platform like Genow. We offer different features like the feedback dashboard and quality insights to help you with data governance during the usage of the platform.

In this article you will find out, which data types we support and how to improve your data quality.

The Genow platform’s ability to extract information from documents and data depends on the quality of the underlying data and documents. For this reason, a good data structure must be created. This documentation provides basic guidelines and shows which data can be extracted.

Even after the test phase, Agent admins should maintain and supplement data. Users can provide feedback on missing or incorrect information.

How Data Is Extracted

When data is first synchronized, it is processed to make it available for generating responses.

During this process, data and information from files are analyzed and interpreted to understand their structure and convert it into a suitable format (technical term: parsing). We combine various methodologies, some developed in-house and others from external service providers like Google.
We use the vectorization technology. This technology ‘understands’ the meaning of your text content and translates it into unique digital codes (‘addresses’). When you start a search or ask a question, our system can immediately find the most relevant sections in your knowledge base or documents – without manual searching. Processing data incurs costs. Find out more here:

Introduction & Parsing

Data Synchronisation

Which Data Sources Are Supported?

We support automated data extraction with data synchronisation from

SharePoint
Google Drive
Jira
Confluence
Google Cloud Buckets
Coming soon: ServiceNow

Your data is stored in an external system? No problem! We can develop an individual connector. Contact us! Find out about options to further develop and individualise your Agent here.

Special Notes for Pipeline Optimisations

There are various optimisation options for data pipelines which, if implemented correctly, can lead to quality improvements and higher user satisfaction. Via the following link you can find a a brief explanation of selected optimisation options and settings: You can find further information about Agent Optimisations and Settings here.

Data Maintenance - General Guidelines

Area	Good (Do)	Bad (Don’t)	Why Important for AI?
Structure	Logical folder structure, clear hierarchy	Too deep/flat structures, vague folder names (“Miscellaneous”)	Context understanding, faster finding of relevant data
Filenames	Descriptive, meaningful, consistent	Nondescript (Doc1.pdf), unclear (final_final.docx)	Content indication, relevance assessment
Currency	Remove/archive/label outdated content, regular review	Keep many old versions in working area	Avoid outdated responses
Relevance	Only relevant data in access area, archive/remove irrelevant	Mix private/irrelevant files	Faster search, avoid inappropriate results, unnecessary costs
Duplicates	Central storage, use links instead of copies	Store same file in multiple locations	Uniqueness, consistency of responses
Maintenance	Regular process, continuous improvement	One-time action, then neglect	Long-term maintenance of data quality and AI performance

Note: The fewer irrelevant pieces of information in the documents, the more efficient Genow’s response output will be. Please optimize your data using this guideline.

Supported Data Types

Supported File Formats

File Type	Detected Elements	Limitations and Recommendations
HTML	Paragraph, table, list, title, heading, header, footer	Parsing heavily relies on HTML tags. CSS-based formatting may not be captured.
PDF	Paragraph, table, title, heading, header, footer	Tables spanning multiple pages may be split into two tables. Generally, the limit is no more than 500 pages and 40 MB in size, ideally no more than 20 MB. However, during the sync process, any previously too large PDFs will be split into sub-documents automatically. We recommend adding data in the appropriate place rather than using footnotes.
DOCX	Paragraphs, tables across multiple pages, lists, titles, headings	Nested tables are not supported. Regarding FAQ-documents: If your document contains many different topics in very small text sections, you may want to consider using an (Excel) table instead of floating text.
PPTX	Paragraphs, tables, lists, titles, headings	Headings must be properly marked in the PowerPoint file. Nested tables and hidden slides are not supported. Speaker notes are not read in PPTX format, but can be included when converting to PDF.
XLSX/XLSM	Tables in Excel sheets supporting INT-, FLOAT-, and STRING values	Multiple table detection is not supported. Hidden tables, rows, or columns may affect detection. Additionally: Headers must exist and be marked, references, scripts, etc. are not recognized. For multiple tables in one sheet: please create separate Excel files. The maximum number of cells is 5 million.
ODT	same as DOCX (ODT is the OpenDocument counterpart)	same as DOCX	same as DOCX (ODT is the OpenDocument counterpart)
ODP	same as PPTX (ODP is the OpenDocument counterpart)	same as PPTX
ODS	same as XLSX (ODS is the OpenDocument counterpart)	same as XLSX

Additions:

We also support SharePoint Pages and -Sites. Please note that you can only select or deselect them all. SharePoint Forms and Lists are not supported.
Images in PDFs and Word files: Images are supported and understood. Limitations exist if on the quality of the images is low or their complexity is high. A way to quickly test this, is to upload the file to the personal space and ask a specific question which can only be answered from information on specific images.

Additional Limitations

Maximum input file size for all file types: 20 MB
We recommend converting to PDF files as this improves readability and storage size as well as loading time in the user UI. This is, however, not mendatory.
Hyperlinks often are not recognized. If you want hyperlinks to be available, please write the links out.
Tables in documents: best if you use a standard layout of tables and provide legends. Especially the case if you have special layouts with special designs.
SharePoint data cannot be selected on a document level. Place single items you want to select in folders.
SharePoint Pages and DriveItems can be selected either all or none for a SharePoint site.

Unsupported File Formats - Action Items:

Videos: We recommend converting to transcripts with the most relevant content. Relevant presentation slides can be captured separately or included in the PDF file with the transcript.
Program code: in JSON files or similar formats: Cannot be processed yet. Individual code snippets can be inserted directly into the chat and processed by language models. However, entire codebases or files cannot be processed.
Audio files: We recommend converting the most relevant parts to transcripts
Other file formats not listed in the table above are not compatible. Feel free to ask us for recommendations or feature requests.

Special Considerations for the Personal Space

As an Agent admin you can allow, that users can combine their personal assets with the Agent knowledge. Personal assets will only be visible by the user who created their asset.

Personal Assets have additional guidelines regarding processed file sizes. The size of information in a Personal Asset is specified in “chunks”. Chunks are divided text segments that allow our AI model to efficiently access and use relevant information from large documents. The actual number of pages that can be processed depends on text size, layout, etc. A Personal Asset has a maximum size of 1000 chunks of data, which corresponds to approximately 1000-2000 pages of text. Additionally, the same requirements apply regarding processed file formats (see table above). If you have a Personal Asset which might be interesting for others as well, you can propose converting it to an actual Agent, where we have close to no limitations for storage sizing. All listed supported file types are also supported in the personal space.

​How Data Is Extracted

Introduction & Parsing

Data Synchronisation

​Which Data Sources Are Supported?

​Special Notes for Pipeline Optimisations

​Data Maintenance - General Guidelines

​Supported Data Types

​Supported File Formats

​Additional Limitations

​Unsupported File Formats - Action Items:

​Special Considerations for the Personal Space

How Data Is Extracted

Which Data Sources Are Supported?

Special Notes for Pipeline Optimisations

Data Maintenance - General Guidelines

Supported Data Types

Supported File Formats

Additional Limitations

Unsupported File Formats - Action Items:

Special Considerations for the Personal Space