Azure 认知搜索的导入数据向导Import data wizard for Azure Cognitive Search

Azure 门户在 Azure 认知搜索仪表板上提供了“导入数据”向导,用于制作索引的原型和加载索引。 The Azure portal provides an Import data wizard on the Azure Cognitive Search dashboard for prototyping and loading an index. 本文介绍该向导的优点和限制、输入和输出以及一些用法信息。This article covers advantages and limitations of using the wizard, inputs and outputs, and some usage information. 有关使用内置示例数据逐步运行该向导的实践指导,请参阅使用 Azure 门户创建 Azure 认知搜索索引快速入门。For hands-on guidance in stepping through the wizard using built-in sample data, see the Create an Azure Cognitive Search index using the Azure portal quickstart.

此向导执行的操作包括:Operations that this wizard performs include:

1 - 连接到支持的 Azure 数据源。1 - Connect to a supported Azure data source.

2 - 创建通过采样源数据推断的索引架构。2 - Create an index schema, inferred by sampling source data.

3 -(可选)添加 AI 扩充以提取或生成内容和结构。3 - Optionally, add AI enrichments to extract or generate content and structure.

4 - 运行向导以创建对象、导入数据、设置计划和其他配置选项。4 - Run the wizard to create objects, import data, set a schedule and other configuration options.

该向导会输出许多对象,这些对象将保存到搜索服务,可通过编程方式或其他工具访问它们。The wizard outputs a number of objects that are saved to your search service, which you can access programatically or in other tools.

优点和限制Advantages and limitations

在编写任何代码之前,可以使用向导进行原型制作和概念证明测试。Before you write any code, you can use the wizard for prototyping and proof-of-concept testing. 向导将连接到外部数据源,对数据进行采样以创建初始索引,然后将数据作为 JSON 文档导入到 Azure 认知搜索中的索引。The wizard connects to external data sources, samples the data to create an initial index, and then imports the data as JSON documents into an index on Azure Cognitive Search.

采样是推断索引架构的过程,它存在一些限制。Sampling is the process by which an index schema is inferred and it has some limitations. 创建数据源时,向导将会选取文档的样本,以确定哪些列是数据源的一部分。When the data source is created, the wizard picks a sample of documents to decide what columns are part of the data source. 不一定会读取所有文件,因为对于极大的数据源,这可能需要好几个小时。Not all files are read, as this could potentially take hours for very large data sources. 在提供一系列文档的情况下,将使用源元数据(例如字段名称或类型)在索引架构中创建字段集合。Given a selection of documents, source metadata, such as field name or type, is used to create a fields collection in an index schema. 根据源数据的复杂性,可能需要编辑初始架构以提高准确性,或对其进行扩展以获得完整性。Depending on the complexity of source data, you might need to edit the initial schema for accuracy, or extend it for completeness. 可以在索引定义页上以内联方式进行更改。You can make your changes inline on the index definition page.

总体而言,使用向导的优点非常明显:只要符合要求,就可以在数分钟内制作出可查询索引的原型。Overall, the advantages of using the wizard are clear: as long as requirements are met, you can prototype a queryable index within minutes. 向导在一定程度上可以处理索引编制的复杂性,例如,以 JSON 文档的形式提供数据。Some of the complexities of indexing, such as providing data as JSON documents, are handled by the wizard.

下面汇总了已知的限制:Known limitations are summarized as follows:

  • 向导不支持迭代或重用。The wizard does not support iteration or reuse. 每次运行向导都会创建新的索引、技能集和索引器配置。Each pass through the wizard creates a new index, skillset, and indexer configuration. 在向导中只能保存和重用数据源。Only data sources can be persisted and reused within the wizard. 若要编辑或细化其他对象,必须使用 REST API 或 .NET SDK 来检索和修改结构。To edit or refine other objects, you have to use the REST APIs or .NET SDK to retrieve and modify the structures.

  • 源内容必须位于受支持的 Azure 数据源中。Source content must reside in a supported Azure data source.

  • 采样是针对一部分源数据执行的。Sampling is over a subset of source data. 对于大型数据源,向导可能会遗漏字段。For large data sources, it's possible for the wizard to miss fields. 如果采样不足,可能需要扩展架构或更正推断出的数据类型。You might need to extend the schema, or correct the inferred data types, if sampling is insufficient.

  • 门户中公开的 AI 扩充限制为几个内置技能。AI enrichment, as exposed in the portal, is limited to a few built-in skills.

  • 可以通过向导创建的知识存储限制为几个默认投影。A knowledge store, which can be created by the wizard, is limited to a few default projections. Blob 容器和表附带默认名称和结构,以方便你保存向导创建的扩充文档。If you want to save enriched documents created by the wizard, the blob container and tables come with default names and structure.

数据源输入Data source input

“导入数据”向导使用 Azure 认知搜索索引器提供的内部逻辑连接到外部数据源。索引器旨在对源采样、读取元数据、破解文档以读取内容和结构,并将内容序列化为 JSON,供后续导入到 Azure 认知搜索。 The Import data wizard connects to an external data source using the internal logic provided by Azure Cognitive Search indexers, which are equipped to sample the source, read metadata, crack documents to read content and structure, and serialize contents as JSON for subsequent import to Azure Cognitive Search.

只能从单个表、数据库视图或等效的数据结构导入,但是,此结构可能包含分层的或嵌套的子结构。You can only import from a single table, database view, or equivalent data structure, however the structure can include hierarchical or nested substructures. 有关详细信息,请参阅如何为复杂类型建模For more information, see How to model complex types.

在运行向导之前,应创建此单个表或视图,其中必须包含内容。You should create this single table or view before running the wizard, and it must contain content. 针对空数据源运行“导入数据”向导并无意义,其原因显而易见。 For obvious reasons, it doesn't make sense to run the Import data wizard on an empty data source.

选项Selection 说明Description
现有数据源Existing data source 如果已在搜索服务中定义索引器,则可能已经获得了一个可以重用的现有数据源定义。If you already have indexers defined in your search service, you might have an existing data source definition that you can reuse. 在 Azure 认知搜索中,数据源对象仅供索引器使用。In Azure Cognitive Search, data source objects are only used by indexers. 可以编程方式或通过“导入数据”向导创建数据源对象,然后按需重用这些对象。 You can create a data source object programmatically or through the Import data wizard, and reuse them as needed.
示例Samples Azure 认知搜索提供了两个在教程和快速入门中使用的内置示例数据源:房地产 SQL 数据库,以及托管在 Cosmos DB 上的“酒店”数据库。Azure Cognitive Search provides two built-in sample data sources that are used in tutorials and quickstarts: a real estate SQL database and a Hotels database hosted on Cosmos DB. 若要根据“酒店”示例进行演练,请参阅在 Azure 门户中创建索引快速入门。For a walk through based on the Hotels sample, see the Create an index in the Azure portal quickstart.
Azure SQL 数据库Azure SQL Database 可以在此页上或通过 ADO.NET 连接字符串,指定服务名称、具有读取权限的数据库用户的凭据和数据库名称。Service name, credentials for a database user with read permission, and a database name can be specified either on the page or via an ADO.NET connection string. 选择要查看或自定义属性的连接字符串选项。Choose the connection string option to view or customize properties.

必须在此页上指定提供行集的表或视图。The table or view that provides the rowset must be specified on the page. 连接成功后会显示此选项,并提供下拉列表以便可以进行选择。This option appears after the connection succeeds, giving a drop-down list so that you can make a selection.
Azure VM 上的 SQL ServerSQL Server on Azure VM 指定完全限定的服务名、用户 ID 和密码以及数据库作为连接字符串。Specify a fully qualified service name, user ID and password, and database as a connection string. 若要使用此数据源,以前必须已在加密连接的本地存储中安装了证书。To use this data source, you must have previously installed a certificate in the local store that encrypts the connection. 有关说明,请参阅SQL VM 与 Azure 认知搜索的连接For instructions, see SQL VM connection to Azure Cognitive Search.

必须在此页上指定提供行集的表或视图。The table or view that provides the rowset must be specified on the page. 连接成功后会显示此选项,并提供下拉列表以便可以进行选择。This option appears after the connection succeeds, giving a drop-down list so that you can make a selection.
Azure Cosmos DBAzure Cosmos DB 要求包括帐户、数据库和集合。Requirements include the account, database, and collection. 集合中的所有文档都将包含在索引中。All documents in the collection will be included in the index. 可以定义查询以平展或筛选行集,或者将查询留空。You can define a query to flatten or filter the rowset, or leave the query blank. 在此向导中,不需要查询。A query is not required in this wizard.
Azure Blob 存储Azure Blob Storage 要求包括存储帐户和容器。Requirements include the storage account and a container. (可选)如果 blob 名称遵循用于分组的虚拟命名约定,可以将名称的虚拟目录部分指定为容器下的某个文件夹。Optionally, if blob names follow a virtual naming convention for grouping purposes, you can specify the virtual directory portion of the name as a folder under container. 有关详细信息,请参阅为 Blob 存储编制索引See Indexing Blob Storage for more information.
Azure 表存储Azure Table Storage 要求包括存储帐户和表名。Requirements include the storage account and a table name. (可选)可以指定一个查询来检索表的子集。Optionally, you can specify a query to retrieve a subset of the tables. 有关详细信息,请参阅为表存储编制索引See Indexing Table Storage for more information.

向导输出Wizard output

在幕后,向导将会创建、配置和调用以下对象。Behind the scenes, the wizard creates, configures, and invokes the following objects. 向导运行完成后,可以在门户页中找到其输出。After the wizard runs, you can find its output in the portal pages. 服务的“概述”页包含索引、索引器、数据源和技能组的列表。The Overview page of your service has lists of indexes, indexers, data sources, and skillsets. 可以在门户中查看索引定义的完整 JSON 代码。Index definitions can be viewed in full JSON in the portal. 对于其他定义,可以使用 REST API 来获取特定的对象。For other definitions, you can use the REST API to GET specific objects.

ObjectObject 说明Description
数据源Data Source 将连接信息(包括凭据)保存到源数据。Persists connection information to source data, including credentials. 某个数据源对象专用于索引器。A data source object is used exclusively with indexers.
索引Index 用于全文搜索和其他查询的物理数据结构。Physical data structure used for full text search and other queries.
技能集Skillset 用于操作、转换和调整内容(包括分析和提取图像文件中的信息)的完整指令集。A complete set of instructions for manipulating, transforming, and shaping content, including analyzing and extracting information from image files. 它包括对提供扩充的认知服务资源的引用,但非常简单的和受限制的结构除外。Except for very simple and limited structures, it includes a reference to a Cognitive Services resource that provides enrichment. 它还可能包含知识存储定义。Optionally, it might also contain a knowledge store definition.
索引器Indexer 一个配置对象,指定数据源、目标索引、可选计划,以及有关错误处理和 base-64 编码的可选技能集、可选计划和可选配置设置。A configuration object specifying a data source, target index, an optional skillset, optional schedule, and optional configuration settings for error handing and base-64 encoding.

如何启动向导How to start the wizard

可以通过服务“概述”页上的命令栏启动“导入数据”向导。The Import data wizard is started from the command bar on the service Overview page.

  1. Azure 门户中,从仪表板打开搜索服务页,或者在服务列表中查找服务In the Azure portal, open the search service page from the dashboard or find your service in the service list.

  2. 在顶部服务概述页中,单击“导入数据” 。In the service overview page at the top, click Import data.

    门户中的“导入数据”命令Import data command in portal

还可以通过其他 Azure 服务(包括 Azure Cosmos DB、Azure SQL 数据库和 Azure Blob 存储)启动“导入数据”。 You can also launch Import data from other Azure services, including Azure Cosmos DB, Azure SQL Database, and Azure Blob storage. 在服务概述页上的左侧导航窗格中查找“添加 Azure 认知搜索”。 Look for Add Azure Cognitive Search in the left-navigation pane on the service overview page.

如何在向导中编辑或完成索引架构How to edit or finish an index schema in the wizard

向导将生成不完整的索引,可以使用从输入数据源获取的文档对其进行填充。The wizard generates an incomplete index, which will be populated with documents obtained from the input data source. 对于功能索引,请确保定义了以下元素。For a functional index, make sure you have the following elements defined.

  1. 字段列表是否完整?Is the field list complete? 添加采样遗漏的新字段,并删除对搜索体验没有作用的,或不会在筛选器表达式评分配置文件中使用的所有字段。Add new fields that sampling missed, and remove any that don't add value to a search experience or that won't be used in a filter expression or scoring profile.

  2. 数据类型是否适合传入的数据?Is the data type appropriate for the incoming data? Azure 认知搜索支持实体数据模型 (EDM) 数据类型Azure Cognitive Search supports the entity data model (EDM) data types. 对于 Azure SQL 数据,有一个映射图表会列出等效值。For Azure SQL data, there is mapping chart that lays out equivalent values. 有关更多背景信息,请参阅字段映射和转换For more background, see Field mappings and transformations.

  3. 是否有一个可用作键的字段? Do you have one field that can serve as the key? 此字段的类型必须是 Edm.string,并且必须唯一标识某个文档。This field must be Edm.string and it must uniquely identify a document. 对于关系数据,它可能会映射到主键。For relational data, it might be mapped to a primary key. 对于 Blob,它可能是 metadata-storage-pathFor blobs, it might be the metadata-storage-path. 如果字段值包含空格或短划线,则必须在“高级选项”下的“创建索引器”步骤中设置“Base 64 编码密钥”选项以禁止验证检查这些字符 。If field values include spaces or dashes, you must set the Base-64 Encode Key option in the Create an Indexer step, under Advanced options, to suppress the validation check for these characters.

  4. 设置属性以确定如何在索引中使用该字段。Set attributes to determine how that field is used in an index.

    请花些时间来完成此步骤,因为属性确定了索引中字段的物理表达式。Take your time with this step because attributes determine the physical expression of fields in the index. 以后若要更改属性(即使是以编程方式进行更改),几乎总要删除并重建索引。If you want to change attributes later, even programmatically, you will almost always need to drop and rebuild the index. SearchableRetrievable 等核心属性对存储会产生负面影响Core attributes like Searchable and Retrievable have a negligible impact on storage. 启用筛选器和使用建议器会提高存储要求。Enabling filters and using suggesters increase storage requirements.

    • “可搜索”启用全文搜索 。Searchable enables full-text search. 在自由格式查询或查询表达式中使用的每个字段必须有此属性。Every field used in free form queries or in query expressions must have this attribute. 为标记为“可搜索”的每个字段创建反向索引 。Inverted indexes are created for each field that you mark as Searchable.

    • “可检索”在搜索结果中返回该字段 。Retrievable returns the field in search results. 用于提供内容以搜索结果的每个字段必须有此属性。Every field that provides content to search results must have this attribute. 设置此字段不会明显影响索引大小。Setting this field does not appreciably effect index size.

    • “可筛选”允许在筛选表达式中引用该字段 。Filterable allows the field to be referenced in filter expressions. 在 $filter 表达式中使用的每个字段必须有此属性 。Every field used in a $filter expression must have this attribute. 筛选表达式用于精确匹配项。Filter expressions are for exact matches. 由于文本字符串保持不变,因此需要额外的存储空间来容纳逐字内容。Because text strings remain intact, additional storage is required to accommodate the verbatim content.

    • “可查找”为分面导航启用该字段 。Facetable enables the field for faceted navigation. 只有也标记为“可筛选”的字段可标记为“可查找” 。Only fields also marked as Filterable can be marked as Facetable.

    • “可排序”允许在排序中使用该字段 。Sortable allows the field to be used in a sort. 在 $Orderby 表达式中使用的每个字段必须有此属性 。Every field used in an $Orderby expression must have this attribute.

  5. 是否需要词法分析Do you need lexical analysis? 对于可搜索的 Edm.string 字段,若要获得语言增强的索引和查询,可以设置分析器For Edm.string fields that are Searchable, you can set an Analyzer if you want language-enhanced indexing and querying.

    默认值为“标准 Lucene”,但如果想要使用 Microsoft 的分析器以进行高级词汇处理(如解决不规则名词和动词形式),可选择“Microsoft 英语” 。The default is Standard Lucene but you could choose Microsoft English if you wanted to use Microsoft's analyzer for advanced lexical processing, such as resolving irregular noun and verb forms. 在门户中只能指定语言分析器。Only language analyzers can be specified in the portal. 若要使用自定义分析器或非语言分析器(例如关键字、模式等),必须以编程方式来实现。Using a custom analyzer or a non-language analyzer like Keyword, Pattern, and so forth, must be done programmatically. 有关分析器的详细信息,请参阅添加语言分析器For more information about analyzers, see Add language analyzers.

  6. 是否需要自动完成或建议结果形式的自动提示功能?Do you need typeahead functionality in the form of autocomplete or suggested results? 选中“建议器”复选框,对所选字段启用自动提示查询建议和自动完成Select the Suggester the checkbox to enable typeahead query suggestions and autocomplete on selected fields. 使用建议器会增加索引中标记化字词的数目,因此会消耗更多的存储。Suggesters add to the number of tokenized terms in your index, and thus consume more storage.

后续步骤Next steps

了解向导的优点和限制的最佳方式就是逐步运行该向导。The best way to understand the benefits and limitations of the wizard is to step through it. 以下快速入门会引导你完成每个步骤。The following quickstart guides you through each step.