This question is like asking, “how long is a piece of string?”, it depends. There is no simple one-size answer as it depends on many competing factors.
In this article these factors will be outlined to help assist you in ensuring that your feed can be processed by Spotler Activate Search. Recommendation maximums for ensuring speedy processing will also be made:
Overall Feed File Size
The limits for the overall feed file size are more around the transfer speed for the file which can be adversely impacted by internet connectivity issues. Spotler Activate uses a nice fat pipe for data ingestion, but as there are systems between the customer and Sooqr on the public internet, it can potentially cause a feed download timeout.
We implement a 1 hour maximum for downloading a feed file. If the download exceeds this duration limit, then it will fail. The main reason for this duration limit is that Spotler Activate Search implements a fall-back for timed-out feed downloads that runs hourly to ensure that a feed download that has failed, potentially due to a temporary internet issue, can be attempted a second and third time, each attempt an hour apart.
Considering the above, we can guestimate the overall filesize maximum for different real-world transfer scenarios ;
- 1 MBPS → 3.6 GB [recommended maximum]
- 10 MBPS → 36 GB
- 100 MBPS → 360 GB
Recommendation:
As Sooqr needs time to actually process the Feed Item after downloading the feed you should strive to keep the feed as small as possible. Alternately, break the feed up into separate files that can each be scheduled to spread the update over a longer period.
How large can a Feed Item be?
A feed contains many items, usually products or posts, that is processed by Spotler Activate Search and stored for search. Each feed is split by Spotler Activate Search into ‘batches’ of 1,000 items which are then processed together.
The number of feed items that Spotler Activate Search can process in a single batch is determined by the size of the feed items. Spotler Activate Search administrators can tune the number of items contained by a batch for feeds with excessively large items. [Callout: Contact Spotler Activate Search admin here if you feel your feed is failing due to chonky items] This ability for Spotler Activate Search admins to tune the batch size means that we have a sliding scale.
- 1000 item batch [standard] = 2MB [recommended maximum]
- 500 item batch = 4 MB
- 100 item batch = 20 MB
Recommendation:
The duration of processing is directly related to the size of each Feed Item. It is recommended to make them as small as possible to ensure that processing can complete within an hour.
How large can field content be?
Spotler Activate Search also has constraints on the size of individual feed elements that is dependent upon the type of data it contains. For numeric types the maximums are defined using IEEE and standard computing conventions for 32 bit binary representation.
- Floats: 32-bit IEEE floating point
- Integers: 32-bit signed integer
For text fields, it depends upon the purpose for the text. If the element contains values used in filtering or sorting, then it has different limits than text solely used in text search. There is, technically, a hard limit that exceeds the Feed Item limit for searchable text. As such, ensure that searchable text does not
note: you must also consider that all text is in UTF-8 format, which is a variable length encoding standard. In the following ‘N characters’ represents a single 8 byte UTF codepoint, which may or may not be an actual character glyph, or what Unicode calls a ‘combining mark’.
- Filter/Sort Values: 32 kB, 3200 characters
- Text Search: constrained by Feed Item size. / 3200 characters.
Recommendation:
For ease of consistency and implementation it is recommended to keep all text fields with a 3200 character maximum limit. This allows the content to be searchable and also function effectively for sort of filter values, if needed.
General Advice for Feed Content
All of the above discusses the maximums, however the performance of search can be severely impacted if the maximums are reached. This section describes the best practices for feed data to ensure that search performance remains speedy:
- Exclude boilerplate text
- Make searchable content as short as you are able and as specific to the individual item as possible
- Exclude HTML markup
- Make Related Items its own multi valued text field
Exclude boilerplate text
Boilerplate text that is the exact same on multiple items should be excluded. Although useful on a website to assist customer understanding, it significantly reduces the relevancy of search for a specific item using words that are shared across items.
Make searchable content as short as you are able and as specific to the individual item as possible
To help boost relevancy you should ensure that your item data is as specific to the item and as uniquely identifying as is possible. Relevancy is also impacted by the length of the content, thanks to the underlying TF/IDF search algorithm. The punchier the text is the more relevant it will be for search.
Exclude HTML markup
Although Spotler Activate Search does a ‘best effort’ attempt to sanitize the feed for
Make Related Items its own multi valued text field
Rather than using a single large text field for all of the related items, such as in a product family, put them into a multivalued text field such as the following;
<related_items>
<node>Item 1</node>
<node>Item 2</node>
<node>Item 3</node>
</related_items>
As the relevance score is impacted by field length, a single match in a massive text field will be significantly less relevant than a single match in a multivalued text field with the related items within a single sub-element. This splitting up of the data also ensures that the maximum Text field size mentioned above is not exceeded.